System, method, and computer software product for genotype determination using probe array data

ABSTRACT

An embodiment of a method of analyzing data from processed images of biological probe arrays is described that comprises receiving a plurality of files comprising a plurality of intensity values associated with a probe on a biological probe array; normalizing the intensity values in each of the data files; determining an initial assignment for a plurality of genotypes using one or more of the intensity values from each file for each assignment; estimating a distribution of cluster centers using the plurality of initial assignments; combining the normalized intensity values with the cluster centers to determine a posterior estimate for each cluster center; and assigning a plurality of genotype calls using a distance of the one or more intensity values from the posterior estimate.

RELATED APPLICATIONS

This application is a divisional of U.S. application Ser. No. 13/468,604filed on May 10, 2012 which is a continuation of Ser. No. 12/123,463filed on May 19, 2008, now U.S. Pat. No. 8,200,440 which claims priorityto U.S. Provisional Application No. 60/938,757, filed May 18, 2007. Thedisclosures of each of these applications are hereby incorporated hereinby reference in their entirety for all purposes.

BACKGROUND Field of the Invention

The present invention relates to systems and methods for processing datausing information gained from examining biological material. Inparticular, a preferred embodiment of the invention relates to analysisof processed image data from scanned biological probe arrays for thepurpose of determining genotype information via identification of SingleNucleotide Polymorphisms (referred to as SNPs).

Related Art

Synthesized nucleic acid probe arrays, such as Affymetrix GENECHIP®probe arrays, and spotted probe arrays, have been used to generateunprecedented amounts of information about biological systems. Forexample, the GENECHIP® Mapping 500K Array Set available from Affymetrix,Inc. of Santa Clara, Calif., is comprised of two microarrays capable ofgenotyping on average 250,000 SNPs per array. Newer arrays developed byAffymetrix can contain probes sufficient to genotype up to one millionSNPs per array. Analysis of genotype data from such microarrays may leadto the development of new drugs and new diagnostic tools.

SUMMARY OF THE INVENTION

Systems, methods, and products to address these and other needs aredescribed herein with respect to illustrative, non-limiting,implementations. Various alternatives, modifications and equivalents arepossible. For example, certain systems, methods, and computer softwareproducts are described herein using exemplary implementations foranalyzing data from arrays of biological materials made by spotting orother methods such as photolithography or bead based systems. However,these systems, methods, and products may be applied with respect to manyother types of probe arrays and, more generally, with respect tonumerous parallel biological assays produced in accordance with otherconventional technologies and/or produced in accordance with techniquesthat may be developed in the future. For example, the systems, methods,and products described herein may be applied to parallel assays ofnucleic acids, PCR products generated from cDNA clones, proteins,antibodies, or many other biological materials. These materials may bedisposed on slides (as typically used for spotted arrays), on substratesemployed for GENECHIP® arrays, or on beads, optical fibers, or othersubstrates or media, which may include polymeric coatings or otherlayers on top of slides or other substrates. Moreover, the probes neednot be immobilized in or on a substrate, and, if immobilized, need notbe disposed in regular patterns or arrays. For convenience, the term“probe array” will generally be used broadly hereafter to refer to allof these types of arrays and parallel biological assays.

An embodiment of a method of analyzing data from processed images ofbiological probe arrays is described that comprises receiving aplurality of files comprising a plurality of intensity values associatedwith a probe on a biological probe array; normalizing the intensityvalues in each of the data files; determining an initial assignment fora plurality of genotypes using one or more of the intensity values fromeach file for each assignment; estimating a distribution of clustercenters using the plurality of initial assignments; combining thenormalized intensity values with the cluster centers to determine aposterior estimate for each cluster center; and assigning a plurality ofgenotype calls using a distance of the one or more intensity values fromthe posterior estimate.

The above embodiments and implementations are not necessarily inclusiveor exclusive of each other and may be combined in any manner that isnon-conflicting and otherwise possible, whether they be presented inassociation with a same, or a different, embodiment or implementation.The description of one embodiment or implementation is not intended tobe limiting with respect to other embodiments and/or implementations.Also, any one or more function, step, operation, or technique describedelsewhere in this specification may, in alternative implementations, becombined with any one or more function, step, operation, or techniquedescribed in the summary. Thus, the above embodiment and implementationsare illustrative rather than limiting.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and further features will be more clearly appreciated from thefollowing detailed description when taken in conjunction with theaccompanying drawings. In the drawings, like reference numerals indicatelike structures or method steps and the leftmost digit of a referencenumeral indicates the number of the figure in which the referencedelement first appears (for example, the element 160 appears first inFIG. 1). In functional block diagrams, rectangles generally indicatefunctional elements and parallelograms generally indicate data. Inmethod flow charts, rectangles generally indicate method steps anddiamond shapes generally indicate decision elements. All of theseconventions, however, are intended to be typical or illustrative, ratherthan limiting.

FIG. 1 is a functional block diagram of one embodiment of a computer anda server enabled to communicate over a network, as well as a probe arrayand probe array instruments;

FIG. 2 is a functional block diagram of one embodiment of the computersystem of FIG. 1, including a display device that presents a graphicaluser interface to a user;

FIG. 3 is a functional block diagram of one embodiment of the server ofFIG. 1, where the server comprises an executable instrument control andimage analysis application; and

FIG. 4 is a functional block diagram of one embodiment of the instrumentcontrol and image analysis application of FIG. 3 comprising an analysisapplication that receives process image files from the instrumentcontrol and image analysis application of FIG. 3 for additionalanalysis.

FIG. 5 is a workflow diagram for a preferred embodiment.

FIG. 6 shows a method of the present invention which calls genotypes byonly using the “contrast” values for each data point.

FIG. 7 shows a method for dividing data into trial genotypes.

FIG. 8 shows dividing lines for trial genotype seeds.

FIG. 9 shows the use of a prior to infer a missing cluster.

FIG. 10 shows that one method tries all (n+1)(n+2)/2 possible divisionsof the data as trial genotype assignments, and the fit is evaluated bylog likelihood of data.

FIG. 11 shows genotype confidence calls.

FIG. 12 shows splitting of two clusters into three.

FIG. 13 shows clustering space transformation.

FIG. 14 shows an example of Cluster Center Stretch transformation.

FIG. 15 shows an example division of simulated data.

FIG. 16 shows another example of dividing the data.

FIG. 17 shows a division of data that includes no AA genotypes.

FIG. 18 shows the performance of BRLMM-P on HapMap samples.

DETAILED DESCRIPTION

Highly accurate and reliable genotype calling is an essential componentof high-density SNP genotyping technology. Rabbee and Speed recentlydeveloped a model called the Robust Linear Model with Mahalanobisdistance classifier (RLMM, pronounced ‘realm’) (See Nusrat Rabbee andTerence P. Speed, “A genotype calling algorithm for Afjymetrix SNParrays” UC Berkeley Statistics Online Tech Reports, August 2005, herebyincorporated by reference in its entirety). We present here an extensionof the RLMM model developed for a commercial nucleic acid array productwhich improves overall performance (call rates and accuracy) and thisextension only requires probes hybridizing to the SNP alleles (theperfect match probes) and does not require use of mis-matched probes,unlike BRLMM, a variation of RLMM that includes a Bayesian step whichprovides improved estimates of cluster centers and variances. The modeldisclosed herein is called BRLMM-P. Bayesian probability is aninterpretation of probability suggested by Bayesian theory, which holdsthat the concept of probability can be defined as the degree to which aperson believes a proposition. Bayesian theory also suggests that Bayes'theorem can be used as a rule to infer or update the degree of belief inlight of new information. There are further differences that aredisclosed below.

Additionally, one advantage is that RLMM performs a multiple chipanalysis, enabling the simultaneous estimation of probe effects andallele signals for each SNP. Accounting for probe specific effectsresults in lower variance on allele signal estimates. Also, anotheradvantage is the estimation of genotypes by a multiple-sampleclassification. It integrates information as necessary from existing,known SNPs to better predict the properties of the underlying clusterscorresponding to the BB, AB, and AA genotypes. The present algorithm,based on the above RLMM model, makes weaker assumptions about thebehavior of probe intensities than does some other algorithms, making itfar more robust in the presence of real-world data.

a) General

The present invention has many preferred embodiments and relies on manypatents, applications and other references for details known to those ofthe art. Therefore, when a patent, application, or other reference iscited or repeated below, it should be understood that it is incorporatedby reference in its entirety for all purposes as well as for theproposition that is recited.

As used in this application, the singular form “a,” “an,” and “the”include plural references unless the context clearly dictates otherwise.For example, the term “an agent” includes a plurality of agents,including mixtures thereof.

An individual is not limited to a human being but may also be otherorganisms including but not limited to mammals, plants, bacteria, orcells derived from any of the above.

Throughout this disclosure, various aspects of this invention can bepresented in a range format. It should be understood that thedescription in range format is merely for convenience and brevity andshould not be construed as an inflexible limitation on the scope of theinvention. Accordingly, the description of a range should be consideredto have specifically disclosed all the possible subranges as well asindividual numerical values within that range. For example, descriptionof a range such as from 1 to 6 should be considered to have specificallydisclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numberswithin that range, for example, 1, 2, 3, 4, 5, and 6. This appliesregardless of the breadth of the range.

The practice of the present invention may employ, unless otherwiseindicated, conventional techniques and descriptions of organicchemistry, polymer technology, molecular biology (including recombinanttechniques), cell biology, biochemistry, and immunology, which arewithin the skill of the art. Such conventional techniques includepolymer array synthesis, hybridization, ligation, and detection ofhybridization using a label. Specific illustrations of suitabletechniques can be had by reference to the example herein below. However,other equivalent conventional procedures can, of course, also be used.Such conventional techniques and descriptions can be found in standardlaboratory manuals such as Genome Analysis: A Laboratory Manual Series(Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: A LaboratoryManual, PCR Primer: A Laboratory Manual, and Molecular Cloning: ALaboratory Manual (all from Cold Spring Harbor Laboratory Press),Stryer, L. (1995) Biochemistry (4th Ed.) Freeman, New York, Gait,“Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press,London, Nelson and Cox (2000), Lehninger, Principles of Biochemistry 3rdEd., W.H. Freeman Pub., New York, N.Y. and Berg et al. (2002)Biochemistry, 5th Ed., W.H. Freeman Pub., New York, N.Y., all of whichare herein incorporated in their entirety by reference for all purposes.

The present invention can employ solid substrates, including arrays insome preferred embodiments. Methods and techniques applicable to polymer(including protein) array synthesis have been described in U.S. Ser. No.09/536,841, WO 00/58516, U.S. Pat. Nos. 5,143,854, 5,242,974, 5,252,743,5,324,633, 5,384,261, 5,405,783, 5,424,186, 5,451,683, 5,482,867,5,491,074, 5,527,681, 5,550,215, 5,571,639, 5,578,832, 5,593,839,5,599,695, 5,624,711, 5,631,734, 5,795,716, 5,831,070, 5,837,832,5,856,101, 5,858,659, 5,936,324, 5,945,334, 5,968,740, 5,974,164,5,981,185, 5,981,956, 6,025,601, 6,033,860, 6,040,193, 6,090,555,6,136,269, 6,269,846 and 6,428,752, in PCT Applications Nos.PCT/US99/00730 (International Publication Number WO 99/36760) andPCT/USOI/04285 (International Publication Number WO 01/58593), which areall incorporated herein by reference in their entirety for all purposes.

Patents that describe synthesis techniques in specific embodimentsinclude U.S. Pat. Nos. 5,412,087, 6,147,205, 6,262,216, 6,310,189,5,889,165, and 5,959,098. Nucleic acid arrays are described in many ofthe above patents, but the same techniques are applied to polypeptidearrays. Nucleic acid arrays that are useful in the present inventioninclude those that are commercially available from Affymetrix (SantaClara, Calif.) under the brand name GENECHIP®. Example arrays are shownon the website at affymetrix.com.

The present invention also contemplates many uses for polymers attachedto solid substrates. These uses include gene expression monitoring,profiling, library screening, genotyping and diagnostics. Geneexpression monitoring and profiling methods can be shown in U.S. Pat.Nos. 5,800,992, 6,013,449, 6,020,135, 6,033,860, 6,040,138, 6,177,248and 6,309,822. Genotyping and uses therefore are shown in U.S. Ser. Nos.10/442,021, 10/013,598 (U.S. PGPub Nos. 20030036069), and U.S. Pat. Nos.5,856,092, 6,300,063, 5,858,659, 6,284,460, 6,361,947, 6,368,799 and6,333,179. Other uses are embodied in U.S. Pat. Nos. 5,871,928,5,902,723, 6,045,996, 5,541,061, and 6,197,506.

Methods for conducting polynucleotide hybridization assays have beenwell developed in the art. Hybridization assay procedures and conditionswill vary depending on the application and are selected in accordancewith the general binding methods known including those referred to in:Maniatis et al. Molecular Cloning: A Laboratory Manual (2^(nd) Ed. ColdSpring Harbor, N.Y, 1989); Berger and Kimmel Methods in Enzymology, Vol.152, Guide to Molecular Cloning Techniques (Academic Press, Inc., SanDiego, Calif., 1987); Young and Davism, P.N.A.S, 80: 1194 (1983).Methods and apparatus for carrying out repeated and controlledhybridization reactions have been described in U.S. Pat. Nos. 5,871,928,5,874,219, 6,045,996 and 6,386,749, 6,391,623 each of which areincorporated herein by reference.

Methods and apparatus for signal detection and processing of intensitydata are disclosed in, for example, U.S. Pat. Nos. 5,143,854, 5,547,839,5,578,832, 5,631,734, 5,800,992, 5,834,758; 5,856,092, 5,902,723,5,936,324, 5,981,956, 6,025,601, 6,090,555, 6,141,096, 6,185,030,6,201,639; 6,218,803; and 6,225,625, in U.S. Ser. Nos. 10/389,194,10/913,102, 10/846,261, 11/260,617 and in PCT Application PCT/US99/06097(published as WO99/47964), each of which also is hereby incorporated byreference in its entirety for all purposes.

The practice of the present invention may also employ conventionalbiology methods, software and systems. Computer software products of theinvention typically include computer readable medium havingcomputer-executable instructions for performing the logic steps of themethod of the invention. Suitable computer readable medium includefloppy disk, CD-ROM/DVD/DVD-ROM, hard-disk drive, flash memory, ROM/RAM,magnetic tapes and etc. The computer executable instructions may bewritten in a suitable computer language or combination of severallanguages. Basic computational biology methods are described in, e.g.Setubal and Meidanis et al., Introduction to Computational BiologyMethods (PWS Publishing Company, Boston, 1997); Salzberg, Searles,Kasif, (Ed.), Computational Methods in Molecular Biology, (Elsevier,Amsterdam, 1998); Rashidi and Buehler, Bioinformatics Basics:Application in Biological Science and Medicine (CRC Press, London, 2000)and Ouelette and Bzevanis Bioinformatics: A Practical Guide for Analysisof Gene and Proteins (Wiley & Sons, Inc., 2nd ed., 2001). See U.S. Pat.No. 6,420,108.

The present invention may also make use of various computer programproducts and software for a variety of purposes, such as probe design,management of data, analysis, and instrument operation. See, U.S. Pat.Nos. 5,593,839, 5,795,716, 5,733,729, 5,974,164, 6,066,454, 6,090,555,6,185,561, 6,188,783, 6,223,127, 6,229,911 and 6,308,170.

Additionally, the present invention may have preferred embodiments thatinclude methods for providing genetic information over networks such asthe Internet as shown in U.S. Ser. Nos. 10/197,621, 10/063,559 (UnitedStates Publication No. 20020183936), Ser. Nos. 10/065,856, 10/065,868,10/328,818, 10/328,872, 10/423,403, and 60/482,389.

b) Definitions

The term “admixture” refers to the phenomenon of gene flow betweenpopulations resulting from migration. Admixture can create linkagedisequilibrium (LD).

The term “allele’ as used herein is any one of a number of alternativeforms a given locus (position) on a chromosome. An allele may be used toindicate one form of a polymorphism, for example, a biallelic SNP mayhave possible alleles A and B. An allele may also be used to indicate aparticular combination of alleles of two or more SNPs in a given gene orchromosomal segment. The frequency of an allele in a population is thenumber of times that specific allele appears divided by the total numberof alleles of that locus.

The term “array” as used herein refers to an intentionally createdcollection of molecules which can be prepared either synthetically orbiosynthetically. The molecules in the array can be identical ordifferent from each other. The array can assume a variety of formats,for example, libraries of soluble molecules; libraries of compoundstethered to resin beads, silica chips, or other solid supports.

The term “complementary” as used herein refers to the hybridization orbase pairing between nucleotides or nucleic acids, such as, forinstance, between the two strands of a double stranded DNA molecule orbetween an oligonucleotide primer and a primer binding site on a singlestranded nucleic acid to be sequenced or amplified. Complementarynucleotides are, generally, A and T (or A and U), or C and G. Two singlestranded RNA or DNA molecules are said to be complementary when thenucleotides of one strand, optimally aligned and compared and withappropriate nucleotide insertions or deletions, pair with at least about80% of the nucleotides of the other strand, usually at least about 90%to 95%, and more preferably from about 98 to 100%. Alternatively,complementarity exists when an RNA or DNA strand will hybridize underselective hybridization conditions to its complement. Typically,selective hybridization will occur when there is at least about 65%complementary over a stretch of at least 14 to 25 nucleotides,preferably at least about 75%, more preferably at least about 90%complementary. See, M. Kanehisa Nucleic Acids Res. 12:203 (1984),incorporated herein by reference.

The term “genome” as used herein is all the genetic material in thechromosomes of an organism. DNA derived from the genetic material in thechromosomes of a particular organism is genomic DNA. A genomic libraryis a collection of clones made from a set of randomly generatedoverlapping DNA fragments representing the entire genome of an organism.

The term “genotype” as used herein refers to the genetic information anindividual carries at one or more positions in the genome. A genotypemay refer to the information present at a single polymorphism, forexample, a single SNP. For example, if a SNP is biallelic and can beeither an A or a C then if an individual is homozygous for A at thatposition the genotype of the SNP is homozygous A or AA. Genotype mayalso refer to the information present at a plurality of polymorphicpositions.

The term “Hardy-Weinberg equilibrium” (HWE) as used herein refers to theprinciple that an allele that when homozygous leads to a disorder thatprevents the individual from reproducing does not disappear from thepopulation but remains present in a population in the undetectableheterozygous state at a constant allele frequency.

The term “hybridization” as used herein refers to the process in whichtwo single-stranded polynucleotides bind non-covalently to form a stabledouble-stranded polynucleotide; triple-stranded hybridization is alsotheoretically possible. The resulting (usually) double-strandedpolynucleotide is a “hybrid.” The proportion of the population ofpolynucleotides that forms stable hybrids is referred to herein as the“degree of hybridization.” Hybridizations are usually performed understringent conditions, for example, at a salt concentration of no morethan about 1 M and a temperature of at least 25° C. For example,conditions of 5×SSPE (750 mM NaCl, 50 mM NaPhosphate, 5 mM EDTA, pH 7.4)and a temperature of 25-30° C. are suitable for allele-specific probehybridizations or conditions of 100 mM MES, 1 M [Na+], 20 mM EDTA, 0.01%Tween-20 and a temperature of 30-50° C., preferably at about 45-50° C.Hybridizations may be performed in the presence of agents such asherring sperm DNA at about 0.1 mg/ml, acetylated BSA at about 0.5 mg/ml.As other factors may affect the stringency of hybridization, includingbase composition and length of the complementary strands, presence oforganic solvents and extent of base mismatching, the combination ofparameters is more important than the absolute measure of any one alone.Hybridization conditions suitable for microarrays are described in theGene Expression Technical Manual, 2004 and the GENECHIP® Mapping AssayManual, 2004.

The term “linkage analysis” as used herein refers to a method of geneticanalysis in which data are collected from affected families, and regionsof the genome are identified that co-segregated with the disease in manyindependent families or over many generations of an extended pedigree. Adisease locus may be identified because it lies in a region of thegenome that is shared by all affected members of a pedigree.

The term “linkage disequilibrium” or sometimes referred to as “allelicassociation” as used herein refers to the preferential association of aparticular allele or genetic marker with a specific allele, or geneticmarker at a nearby chromosomal location more frequently than expected bychance for any particular allele frequency in the population. Forexample, if locus X has alleles A and B, which occur equally frequently,and linked locus Y has alleles C and D, which occur equally frequently,one would expect the combination AC to occur with a frequency of 0.25.If AC occurs more frequently, then alleles A and Care in linkagedisequilibrium. Linkage disequilibrium may result from natural selectionof certain combination of alleles or because an allele has beenintroduced into a population too recently to have reached equilibriumwith linked alleles. The genetic interval around a disease locus may benarrowed by detecting disequilibrium between nearby markers and thedisease locus. For additional information on linkage disequilibrium seeArdlie et al., Nat. Rev. Gen. 3:299-309, 2002.

The term “lod score” or “LOD” is the log of the odds ratio of theprobability of the data occurring under the specific hypothesis relativeto the null hypothesis. LOD=log [probability assuminglinkage/probability assuming no linkage].

The terms “mismatch” and “perfect match” describe the relationshipbetween the sequence of the intended target and the probe that is on anarray. A perfect match probe is designed to exactly match the intendedtarget sequence. The mismatch is designed to have at least one base thatis not part of the intended target. A mismatch probe is a probe that isdesigned to be complementary to a reference sequence except for somemismatches that may significantly affect the hybridization between theprobe and its target sequence. In preferred embodiments, mismatch probesare designed to be complementary to a reference sequence except for ahomomeric base mismatch at the central (e.g., 13th in a 25 base probe)position. Mismatch probes are normally used as controls forcross-hybridization. A probe pair is usually composed of a perfect matchand its corresponding mismatch probe. In preferred embodiments, thedifference between perfect match and mismatch provides an intensitydifference in a probe pair. The array that is preferred in the presentinvention contains all perfect match probes and does not includemismatch probes for target sequences.

The term “oligonucleotide” or sometimes refer by “polynucleotide” asused herein refers to a nucleic acid ranging from at least 2, preferableat least 8, and more preferably at least 20 nucleotides in length or acompound that specifically hybridizes to a polynucleotide.Polynucleotides of the present invention include sequences ofdeoxyribonucleic acid (DNA) or ribonucleic acid (RNA) which may beisolated from natural sources, recombinantly produced or artificiallysynthesized and mimetics thereof. A further example of a polynucleotideof the present invention may be peptide nucleic acid (PNA). Theinvention also encompasses situations in which there is a nontraditionalbase pairing such as Hoogsteen base pairing which has been identified incertain tRNA molecules and postulated to exist in a triple helix.“Polynucleotide” and “oligonucleotide” are used interchangeably in thisapplication.

The term “polymorphism” as used herein refers to the occurrence of twoor more genetically determined alternative sequences or alleles in apopulation. A polymorphic marker or site is the locus at whichdivergence occurs. Preferred markers have at least two alleles, eachoccurring at frequency of greater than 1%, and more preferably greaterthan 10% or 20% of a selected population. A polymorphism may compriseone or more base changes, an insertion, a repeat, or a deletion. Apolymorphic locus may be as small as one base pair. Polymorphic markersinclude restriction fragment length polymorphisms, variable number oftandem repeats (VNTR's), hypervariable regions, minisatellites,dinucleotide repeats, trinucleotide repeats, tetranucleotide repeats,simple sequence repeats, and insertion elements such as Alu. The firstidentified allelic form is arbitrarily designated as the reference formand other allelic forms are designated as alternative or variantalleles. The allelic form occurring most frequently in a selectedpopulation is sometimes referred to as the wildtype form. Diploidorganisms may be homozygous or heterozygous for allelic forms. Adiallelic polymorphism has two forms. A triallelic polymorphism hasthree forms. Single nucleotide polymorphisms (SNPs) are included inpolymorphisms.

The term “primer” as used herein refers to a single-strandedoligonucleotide capable of acting as a point of initiation fortemplate-directed DNA synthesis under suitable conditions for example,buffer and temperature, in the presence of four different nucleosidetriphosphates and an agent for polymerization, such as, for example, DNAor RNA polymerase or reverse transcriptase. The length of the primer, inany given case, depends on, for example, the intended use of the primer,and generally ranges from 15 to 30 nucleotides. Short primer moleculesgenerally require cooler temperatures to form sufficiently stable hybridcomplexes with the template. A primer need not reflect the exactsequence of the template but must be sufficiently complementary tohybridize with such template. The primer site is the area of thetemplate to which a primer hybridizes. The primer pair is a set ofprimers including a 5′ upstream primer that hybridizes with the 5′ endof the sequence to be amplified and a 3′ downstream primer thathybridizes with the complement of the 3′ end of the sequence to beamplified.

The term “prior” as used as a noun herein refers to an estimate of aparameter plus the uncertainty in the distribution of that parameterthat is entered into the calculation before any (current) data isobserved. This is standard notation in Bayesian statistics. Such valuesas estimates for genotype cluster center locations and variances can beused as prior values (such as ones obtained from other data sets or userentered quantities).

The term “probe” as used herein refers to a surface-immobilized moleculethat can be recognized by a particular target. See U.S. Pat. No.6,582,908 for an example of arrays having all possible combinations ofprobes with 10, 12, and more bases. Examples of probes that can beinvestigated by this invention include, but are not restricted to,agonists and antagonists for cell membrane receptors, toxins and venoms,viral epitopes, hormones (for example, opioid peptides, steroids, etc.),hormone receptors, peptides, enzymes, enzyme substrates, cofactors,drugs, lectins, sugars, oligonucleotides, nucleic acids,oligosaccharides, proteins, and monoclonal antibodies. The array that ispreferred in the present invention contains all perfect match probes anddoes not include mismatch probes for target sequences.

c) Embodiments of the Present Invention

Embodiments of an image analysis system comprising an image analysis andinstrument control application are described herein that provide aflexible and dynamically configurable architecture and a low level ofcomplexity. In particular, embodiments are described that provide filemanagement functionality where each file comprises a unique identifierand logical relationships between the files using those identifiers.Further, the embodiments include a modular architecture for customizingcomponents and functionality to meet individual needs as well as userinterfaces provided over a network that provide a less restrictiveworkflow environment.

Probe Array 140: An illustrative example of probe array 140 is providedin FIGS. 1, 2, and 3. Descriptions of probe arrays are provided abovewith respect to “Nucleic Acid Probe arrays” and other relateddisclosure. In various implementations, probe array 140 may be disposedin a cartridge or housing. Examples of probe arrays and associatedcartridges or housings may be found in U.S. Pat. Nos. 5,945,334,6,287,850, 6,399,365, 6,551,817, each of which is also herebyincorporated by reference herein in its entirety for all purposes. Inaddition, some embodiments of probe array 140 may be associated withpegs or posts, where for instance probe array 140 may be affixed viagluing, welding, or other means known in the related art to the peg orpost that may be operatively coupled to a tray, strip or other type ofsimilar substrate. Examples with embodiments of probe array 140associated with pegs or posts may be found in U.S. patent Ser. No.10/826,577.

Scanner 100: Labeled targets hybridized to probe arrays may be detectedusing various devices, sometimes referred to as scanners, as describedabove with respect to methods and apparatus for signal detection.

An illustrative device is shown in FIG. 1 as scanner 100. For example,scanners image the targets by detecting fluorescent or other emissionsfrom labels associated with target molecules, or by detectingtransmitted, reflected, or scattered radiation. A typical scheme employsoptical and other elements to provide excitation light and toselectively collect the emissions.

For example, scanner 100 provides a signal representing the intensities(and possibly other characteristics, such as color that may beassociated with a detected wavelength) of the detected emissions orreflected wavelengths of light, as well as the locations on thesubstrate where the emissions or reflected wavelengths were detected.Typically, the signal includes intensity information corresponding toelemental sub-areas of the scanned substrate. The term “elemental” inthis context means that the intensities, and/or other characteristics,of the emissions or reflected wavelengths from this area each arerepresented by a single value. When displayed as an image for viewing orprocessing, elemental picture elements, or pixels, often represent thisinformation. Thus, in the present example, a pixel may have a singlevalue representing the intensity of the elemental sub-area of thesubstrate from which the emissions or reflected wavelengths werescanned. The pixel may also have another value representing anothercharacteristic, such as color, positive or negative image, or other typeof image representation. The size of a pixel may vary in differentembodiments and could include a 2.5 μm, 1.5 μm, 1.0 μm, or sub-micronpixel size. Two examples where the signal may be incorporated into dataare data files in the form *.dat or *.tif as generated respectively byAffymetrix Microarray Suite (described in U.S. Pat. No. 7,031,846) basedon images scanned from GENECHIP® arrays. Examples of scanner systemsthat may be implemented with embodiments of the present inventioninclude U.S. patent application Ser. No. 10/389,194 now U.S. Pat. No.7,689,022, Ser. No. 10/846,261 now U.S. Pat. No. 7,148,492, Ser. No.10/913,102 now U.S. Pat. No. 7,317,415, and Ser. No. 11/260,617 now U.S.Pat. No. 7,682,782, each of which are incorporated by reference above.

Autoloader 110: Illustrated in FIG. 1 is autoloader 110 that is anexample of one possible embodiment of an automatic loader that providestransport of one or more probe arrays 140 used in conjunction withscanner 100 and fluid handling system 115.

In some embodiments, autoloader 110 may include a number of componentssuch as, for instance, a magazine, tray, carousel, or other means ofholding and/or storing a plurality of probe arrays; a transportassembly; and a thermal control chamber. For example, someimplementations of autoloader 110 may include features for preservingthe biological integrity of the probe arrays for extended periods suchas, for instance, a period of up to sixteen hours. Also in the presentexample, in the event of a power failure or error condition thatprevents scanning or other processing steps, autoloader 110 willindicate the failure to user 101 and maintain storage temperature forall probe arrays 140 through the use of what may be referred to as anuninterruptable power supply system. The power failure or other errormay be communicated to user 101 by one or more methods that couldinclude audible/visual alarm indicators, a graphical user interface,automated paging system, alert via a graphical user interface providedby instrument control and image analysis applications 372, or othermeans of automated communication. Still continuing with the presentexample, the power supply system could also support one or more othersystems such as scanner 100 or fluid handling system 115.

Some embodiments of autoloader 110 may include pre-heating eachembodiment of probe array 140 to a preferred temperature prior to orduring particular processing or image acquisition operations. Forexample, autoloader 110 may employ a thermally controlled chamber topre-heat one or more probe arrays 140 to the same temperature as theinternal environment of scanner 100 prior to transport to the scanner.Similarly, autoloader 110 could bring probe array 140 to the appropriatehybridization temperature prior to loading into fluid handling system115. Also in the present example, autoloader 110 may also employ one ormore thermal control operations as post-processing steps such as whenautoloader 110 removes each of probe arrays 140 from scanner 100,autoloader 110 may employ one or more environmental or temperaturecontrol elements to warm or cool the probe array to a preferredtemperature in order to preserve biological integrity.

Many embodiments of autoloader 110 are enabled to provide automatedloading/unloading of probe arrays 140 to both fluid handling system 115and/or scanner 100. Also, some embodiments of autoloader 110 may beequipped with a barcode reader, or other means of identification andinformation storage such as, for instance, magnetic strips, what arereferred to by those of ordinary skill in the related art as radiofrequency identification (RFID), or one or more microchips associatedwith each embodiment of probe array 140. For example, autoloader 110 mayread or otherwise identify encoded information from the means ofidentification and information storage that in the present example mayinclude a barcode associated with probe array 140. Autoloader 110 mayuse the information and/or identifier directly in one or more operationsor alternatively may forward the information and/or identifier toinstrument control and image analysis applications 372 of server 120 forprocessing, where applications 372 may then provide instruction toautoloader 110 based, at least in part, upon the processed informationand/or identifier. Also in some implementations, scanner 100 and/orfluid handling system 115 may also be similarly equipped with a barcodereader or other means as described above.

Additional examples of autoloaders and probe array storage instrumentsare described in U.S. patent application Ser. Nos. 10/389,194 and10/684,160; and U.S. Pat. Nos. 6,511,277 and 6,604,902 each of which ishereby incorporated herein by reference in their entireties for allpurposes.

Fluid Handling System 115: Embodiments of fluid handling system 115, asillustrated in FIG. 1, may implement one or more procedures oroperations for hybridizing one or more experimental samples to probesassociated with one or more probe arrays 140, as well as operationsthat, for instance, may include exposing each of probe arrays 140 towashes, buffers, stains, or other fluids in a sequential or parallelfashion. Some embodiments of the present invention may include probearray 140 enclosed in a housing or cartridge that may be placed in acarousel, tray, or other means of holding for transport or processing aspreviously described with respect to autoloader 110. For example, acarousel, tray, or carrier may be specifically enabled to register aplurality of probe array 140/housing embodiments in a specificorientation and may enable or improve high throughput processing of eachof the plurality of probe arrays 140 by providing positive positionalregistration so that the robotic instrument may carry out processingsteps in an efficient and repeatable fashion. Additional examples of afluid handling system that interacts with various implementations ofprobe array 140/housing embodiments is described in U.S. patentapplication Ser. No. 11/057,320, which is hereby incorporated byreference herein in its entirety for all purposes.

Embodiments of fluid handling system 115 could include a plurality ofelements enabled to automatically introduce and remove fluids from aprobe array 140 without user intervention such as, for instance, one ormore sample holders, fluid transfer devices, and fluid reservoirs. Forexample, applications 372 may direct fluid handling system 115 to add aspecified volume of a particular sample to an associated implementationof probe array 140. In the present example, fluid handling system 115removes the specified volume of sample from a reservoir positioned in asample holder via one of sample transfer pins, pipettes or pipette tips,specialized adaptors, or other means known to those of ordinary skill inthe related art. In some embodiments, the sample holder may be thermallycontrolled in order to maintain the integrity of the samples, reagents,or fluids contained in the reservoirs, for a preferred temperatureaccording to a specific protocol or processing step, or for temperatureconsistency of the various fluids exposed to probe array 140. The term“reservoir” as used herein could include a vial, tube, bottle, 96 or 384well plate, or some other container suitable for holding volumes ofliquid. Also in the present example, fluid handling system 115 mayemploy a vacuum/pressure source, valves, and means for fluid transportknown to those of ordinary skill in the related art.

In some embodiments, fluid handling system 115 may interface with eachof one or more of probe arrays 140 by moving a fluid transfer devicesuch as, for instance, what may be referred to as a pin or needle suchas a dual lumen needle, pipette tip, specialized adaptor or other typeof fluid transfer device known in the art. For example, as those ofordinary skill in the related art will appreciate, a plurality of fluidtransfer devices such as a robotic device comprising a pipettorcomponent coupled to one or more pipette tips may be employed to engagewith one or more of interfaces or alternatively direct fluid to anexposed surface, in order to process one or more of probe arrays 140,where a plurality of probe arrays 140 may be processed in parallel. Inthe present example, fluid handling system 115 may simultaneously or ina sequential fashion process a plurality of probe arrays 140 by removinga specified aliquot of sample or other type of fluid from each reservoirdisposed in one or more sample holders and deliver each sample or fluidto probe array 140.

Fluid handling system 115 may remove used sample or waste fluids fromprobe array 140 by, for instance, creating a negative pressure or vacuumthrough one or more ports associated with a housing. Alternatively,fluids may be similarly expelled using a positive pressure of air, gas,or other type of fluid either alone or in combination with the negativepressure, through one or more ports where the positive pressure maycause the undesired fluid to be expelled through one or more channels oraway from an exposed surface. Expelled of removed fluids may be storedin one or more reservoir or alternatively may be expelled from fluidhandling system 115 into another waste receptacle or drain. For example,it may be desirable in some implementations for user 101 to recover asample from probe array 140 and store the recovered sample in anenvironmentally controlled receptacle in order to preserve thebiological integrity.

As those of ordinary skill in the related art will appreciate, thesample content of each reservoir within a sample holder is known so thatapplications 372 may associate an experimental sample or fluid with aparticular embodiment of probe array 140. Fluid handling system 115 mayalso provide one or more detectors associated with the sample holder toindicate to applications 372 when a reservoir is present or absent.Additionally, fluid handling system 115 may include one or moreimplementations of a barcode reader, or other means of identificationdescribed above with respect to autoloader 110, enabled to identify eachreservoir using an associated barcode identifier or other type ofmachine readable identifier.

Some embodiments of fluid handling system 115 may include one or moredetection systems enabled to detect the presence and identity of a fluidassociated with probe array 140. Also, some embodiments of fluidhandling system 115 may provide an environment that promotes thehybridization of a biological target contained in a sample to the probesof the probe array. Some environmental conditions that affect thehybridization efficiency could include temperature, gas bubbles,agitation, oscillating fluid levels, or other conditions that couldpromote the hybridization of biological samples to probes. Otherenvironmental conditions that fluid handling system 115 may provide mayinclude a means to provide or improve mixing of fluids. For example ameans of shaking probe array 140 to promote inertial movement of fluidsand turbulent flow may include what is generally referred as a plateshaker, rotating carousel, or other shaking instrument. Other sources offluid mixing could be provided by an ultrasonic source or mechanicalsource such as for instance a piezo-electric agitation source, or othermeans of providing mechanical agitation. In the present example, theagitation or shaking means may provide fluidic movement that may improvethe efficiency of hybridization of target molecules in a sample to probearray 140. Other examples of elements and methods for mixing fluids in achamber are provided in U.S. patent application Ser. No. 11/017,095,titled “System and Method for Improved Hybridization Using EmbeddedResonant Mixing Elements”, filed Dec. 20, 2004 which is herebyincorporated by reference herein in its entirety for all purposes.

Embodiments of fluid handling system 115 may also perform what those ofordinary skill in the related art may refer to as post hybridizationoperations such as, for instance, washes with buffers or reagents,water, labels, or antibodies. For example, staining may includeintroducing a stain comprising molecules with fluorescent tags thatselectively bind to the biological molecules or targets that havehybridized to probe array 140. Additional post-hybridization operationsmay, for example, include the introduction of what is referred to as anon-stringent buffer to probe array 140 to preserve the integrity of thehybridized array.

Some implementations of fluid handling system 115 allow for interruptionof operations to insert or remove probe arrays, samples, reagents,buffers, or any other materials. After interruption, fluid handlingsystem 115 may conduct a scan of some or all identifiers associated withprobe arrays, samples, carousels, trays, or magazines, user inputidentifiers, or other identifiers used in an automated process. Forexample, user 101 may wish to interrupt the process conducted by fluidhandling system 115 to remove a tray of samples and insert a new tray.The interruption is communicated to user 101 by a variety of methods,and the user performs the desired tasks. User 101 inputs a command forthe resumption of the process that may begin with fluid handling system115 scanning all available barcode identifiers. Applications 372determines what has been changed, and makes the appropriate adjustmentsto procedures and protocols.

Fluid handling system 115 may also perform operations that do not actdirectly upon a probe array. Such functions could include the managementof fresh versus used reagents and buffers, experimental samples, orother materials utilized in hybridization operations. Additionally,fluid handling system 115 may include features for leak control andisolation from systems that may be sensitive to exposure to liquids. Forexample, a user may load a variety of experimental samples into fluidhandling system 115 that have unique experimental requirements. In thepresent example the samples may have barcode labels with uniqueidentifiers associated with them. The barcode labels could be scannedwith a hand held reader or alternatively fluid handling system 115 couldinclude a dedicated reader. Alternatively, other means of identificationcould be used as described above. The user may associate the identifierwith the sample and store the data into one or more data files. Thesample may also be associated with a specific probe array type that issimilarly stored.

Additional examples of hybridization and other type of probe arrayprocessing instruments are described in U.S. patent application Ser.Nos. 10/684,160 and 10/712,860, both of which are hereby incorporated byreference herein in their entireties for all purposes.

Computer 150: An illustrative example of computer 150 is provided inFIG. 1 and also in greater detail in FIG. 2. Computer 150 may be anytype of computer platform such as a workstation, a personal computer, aserver, or any other present or future computer. Computer 150 typicallyincludes known components such as a processor 255, an operating system260, system memory 270, memory storage devices 281, and input-outputcontrollers 275, input-output devices 240, and display devices 245.Display devices 245 may include display devices that provides visualinformation, this information typically may be logically and/orphysically organized as an array of pixels. A Graphical user interface(GUI) controller may also be included that may comprise any of a varietyof known or future software programs for providing graphical input andoutput interfaces such as for instance GUI's 246. For example, GUI's 246may provide one or more graphical representations to a user, such asuser 101, and also be enabled to process user inputs via GUI's 246 usingmeans of selection or input known to those of ordinary skill in therelated art.

It will be understood by those of ordinary skill in the relevant artthat there are many possible configurations of the components ofcomputer 150 and that some components that may typically be included incomputer 150 are not shown, such as cache memory, a data backup unit,and many other devices. Processor 255 may be a commercially availableprocessor or it may be one of other processors that are or will becomeavailable. Some embodiments of processor 255 may also include what arereferred to as Multi-core processors and/or be enabled to employparallel processing technology in a single or multi-core configuration.For example, a multi-core architecture typically comprises two or moreprocessor “execution cores”. In the present example each execution coremay perform as an independent processor that enables parallel executionof multiple threads. In addition, those of ordinary skill in the relatedwill appreciate that processor 255 may be configured in what isgenerally referred to as 32 or 64 bit architectures, or otherarchitectural configurations now known or that may be developed in thefuture.

Processor 255 executes operating system 260. Operating system 260interfaces with firmware and hardware in a well-known manner, andfacilitates processor 255 in coordinating and executing the functions ofvarious computer programs that may be written in a variety ofprogramming languages. Operating system 260, typically in cooperationwith processor 255, coordinates and executes functions of the othercomponents of computer 150. Operating system 260 also providesscheduling, input-output control, file and data management, memorymanagement, and communication control and related services, all inaccordance with known techniques.

System memory 270 may be any of a variety of known or future memorystorage devices. Examples include any commonly available random accessmemory (RAM), magnetic medium such as a resident hard disk or tape, anoptical medium such as a read and write compact disc, or other memorystorage device. Memory storage devices 281 may be any of a variety ofknown or future devices, including a compact disk drive, a tape drive, aremovable hard disk drive, USB or flash drive, or a diskette drive. Suchtypes of memory storage devices 281 typically read from, and/or writeto, a program storage medium (not shown) such as, respectively, acompact disk, magnetic tape, removable hard disk, USB or flash drive, orfloppy diskette. Any of these program storage media, or others now inuse or that may later be developed, may be considered a computer programproduct. As will be appreciated, these program storage media typicallystore a computer software program and/or data. Computer softwareprograms, also called computer control logic, typically are stored insystem memory 270 and/or the program storage device used in conjunctionwith memory storage device 281.

In some embodiments, a computer program product is described comprisinga computer usable medium having control logic (computer softwareprogram, including program code) stored therein. The control logic, whenexecuted by processor 255, causes processor 255 to perform functionsdescribed herein. In other embodiments, some functions are implementedprimarily in hardware using, for example, a hardware state machine.Implementation of the hardware state machine so as to perform thefunctions described herein will be apparent to those skilled in therelevant arts.

Input-output controllers 275 could include any of a variety of knowndevices for accepting and processing information from a user, whether ahuman or a machine, whether local or remote. Such devices include, forexample, modem cards, wireless cards, network interface cards, soundcards, or other types of controllers for any of a variety of known inputdevices. Output controllers of input-output controllers 275 couldinclude controllers for any of a variety of known display devices forpresenting information to a user, whether a human or a machine, whetherlocal or remote. In the illustrated embodiment, the functional elementsof computer 150 communicate with each other via system bus 290. Some ofthese communications may be accomplished in alternative embodimentsusing network or other types of remote communications.

As will be evident to those skilled in the relevant art, an instrumentcontrol and image processing application, such as for instance animplementation of instrument control and image processing applications372 illustrated in FIG. 3, if implemented in software, may be loadedinto and executed from system memory 270 and/or memory storage device281. All or portions of the instrument control and image processingapplications may also reside in a read-only memory or similar device ofmemory storage device 281, such devices not requiring that theinstrument control and image processing applications first be loadedthrough input-output controllers 275. It will be understood by thoseskilled in the relevant art that the instrument control and imageprocessing applications, or portions of it, may be loaded by processor255 in a known manner into system memory 270, or cache memory (notshown), or both, as advantageous for execution. Also illustrated in FIG.2 are library files 274, experiment data 277, and internet client 279stored in system memory 270. For example, experiment data 277 couldinclude data related to one or more experiments or assays such asexcitation wavelength ranges, emission wavelength ranges, extinctioncoefficients and/or associated excitation power level values, or othervalues associated with one or more fluorescent labels. Additionally,internet client 279 may include an application enabled to accesses aremote service on another computer using a network that may for instancecomprise what are generally referred to as “Web Browsers”. Also, in thesame or other embodiments internet client 279 may include, or could bean element of, specialized software applications enabled to accessremote information via a network such as network 125 such as, forinstance, the GENECHIP® Data Analysis Software (GDAS) package orChromosome Copy Number Tool (CNAT) both available from Affymetrix, Inc.of Santa Clara Calif. that are each enabled to access information fromremote sources, and in particular probe array annotation informationfrom the NETAFFX® web site hosted on one or more servers provided byAffymetrix, Inc.

Network 125 may include one or more of the many various types ofnetworks well known to those of ordinary skill in the art. For example,network 125 may include a local or wide area network that employs whatis commonly referred to as a TCP/IP protocol suite to communicate, thatmay include a network comprising a worldwide system of interconnectedcomputer networks that is commonly referred to as the internet, or couldalso include various intranet architectures. Those of ordinary skill inthe related arts will also appreciate that some users in networkedenvironments may prefer to employ what are generally referred to as“firewalls” (also sometimes referred to as Packet Filters, or BorderProtection Devices) to control information traffic to and from hardwareand/or software systems. For example, firewalls may comprise hardware orsoftware elements or some combination thereof and are typically designedto enforce security policies put in place by users, such as for instancenetwork administrators, etc.

Server 120: FIG. 1 shows a typical configuration of a server computerconnected to a workstation computer via a network that is illustrated infurther detail in FIG. 3. In some implementations any function ascribedto Server 120 may be carried out by one or more other computers, and/orthe functions may be performed in parallel by a group of computers.

Typically, server 120 is a network-server class of computer designed forservicing a number of workstations or other computer platforms over anetwork. However, server 120 may be any of a variety of types ofgeneral-purpose computers such as a personal computer, workstation, mainframe computer, or other computer platform now or later developed.Server 120 typically includes known components such as processor 355,operating system 360, system memory 370, memory storage devices 381, andinput-output controllers 378. It will be understood by those skilled inthe relevant art that there are many possible configurations of thecomponents of server 120 that may typically include cache memory, a databackup unit, and many other devices. Similarly, many hardware andassociated software or firmware components may be implemented in anetwork server. For example, components to implement one or morefirewalls to protect data and applications, uninterruptable powersupplies, LAN switches, web-server routing software, and many othercomponents. Those of ordinary skill in the art will readily appreciatehow these and other conventional components may be implemented.

Processor 355 may include multiple processors. Processor 355 executesoperating system 360. Some embodiments of processor 355 may also includewhat are referred to as Multi-core processors and/or be enabled toemploy parallel processing technology in a single or multi-coreconfiguration similar to that as described above with respect toprocessor 255. In addition, those of ordinary skill in the related willappreciate that processor 355 may be configured in what is generallyreferred to as 32 or 64 bit architectures, or other architecturalconfigurations now known or that may be developed in the future.

Operating system 360 interfaces with firmware and hardware in awell-known manner, and facilitates processor 355 in coordinating andexecuting the functions of various computer programs that may be writtenin a variety of programming languages. Operating system 360, typicallyin cooperation with the processor, coordinates and executes functions ofthe other components of server 120. Operating system 360 also providesscheduling, input-output control, file and data management, memorymanagement, and communication control and related services, all inaccordance with known techniques.

System memory 370 may be any of a variety of known or future memorystorage devices. Examples include any commonly available random accessmemory (RAM), magnetic medium such as a resident hard disk or tape, anoptical medium such as a read and write compact disc, or other memorystorage device. Memory storage device 381 may be any of a variety ofknown or future devices, including a compact disk drive, a tape drive, aremovable hard disk drive, USB or flash drive, or a diskette drive. Suchtypes of memory storage device typically read from, and/or write to, aprogram storage medium (not shown) such as, respectively, a compactdisk, magnetic tape, removable hard disk, USB or flash drive, or floppydiskette. Any of these program storage media, or others now in use orthat may later be developed, may be considered a computer programproduct. As will be appreciated, these program storage media typicallystore a computer software program and/or data. Computer softwareprograms, also called computer control logic, typically are stored inthe system memory and/or the program storage device used in conjunctionwith the memory storage device.

In some embodiments, a computer program product is described comprisinga computer usable medium having control logic (computer softwareprogram, including program code) stored therein. The control logic, whenexecuted by the processor, causes the processor to perform functionsdescribed herein. In other embodiments, some functions are implementedprimarily in hardware using, for example, a hardware state machine.Implementation of the hardware state machine so as to perform thefunctions described herein will be apparent to those skilled in therelevant arts.

Input-output controllers 375 could include any of a variety of knowndevices for accepting and processing information from a user, whether ahuman or a machine, whether local or remote. Such devices include, forexample, modem cards, network interface cards, sound cards, or othertypes of controllers for any of a variety of known input or outputdevices. In the illustrated embodiment, the functional elements ofserver 120 communicate with each other via system bus 390. Some of thesecommunications may be accomplished in alternative embodiments usingnetwork or other types of remote communications.

As will be evident to those skilled in the relevant art, a serverapplication if implemented in software, may be loaded into the systemmemory and/or the memory storage device through one of the inputdevices, such as instrument control and image processing applications372 described in greater detail below. All or portions of these loadedelements may also reside in a read-only memory or similar device of thememory storage device, such devices not requiring that the elementsfirst be loaded through the input devices. It will be understood bythose skilled in the relevant art that any of the loaded elements, orportions of them, may be loaded by the processor in a known manner intothe system memory, or cache memory (not shown), or both, as advantageousfor execution.

Instrument control and image processing applications 372: Instrumentcontrol and image processing applications 372 may comprise any of avariety of known or future image processing applications. Some examplesof known instrument control and image processing applications includethe Affymetrix Microarray Suite, and Affymetrix GENECHIP® OperatingSoftware (hereafter referred to as GCOS) applications. Typically,embodiments of applications 372 may be loaded into system memory 270and/or memory storage device 281 through one of input devices 240.

Some improved embodiments of applications 372 include executable codebeing stored in system memory 270, illustrated in FIG. 3 as instrumentcontrol and analysis applications executables 372A, of an implementationof server 120. For example, the described embodiments of applications372 may include what may be referred to as the Affymetrixcommand-console software. Embodiments of applications 372 mayadvantageously provide what is referred to as a modular interface forone or more computers or workstations and one or more servers, as wellas one or more instruments. The term “modular” as used herein generallyrefers to elements that may be integrated to and interact with a coreelement in order to provide a flexible, updateable, and customizableplatform. For example, as will be described in greater detail belowapplications 372 may comprise a “core” software element enabled tocommunicate and perform primary functions necessary for any instrumentcontrol and image processing application. Such primary functionality mayinclude communication over various network architectures. In the presentexample, modular software elements, such as for instance plug-in module376, may be interfaced with the core software element to perform morespecific or secondary functions. In particular, the specific orsecondary functions may include functions customizable for particularapplications desired by user 101. Further, modules integrated with thecore software elements are considered to be a single softwareapplication such as applications 372.

In the presently described implementation, applications 372 maycommunicate with and control one or more elements or processes of theone or more servers, one or more workstations, and the one or moreinstruments. Also, embodiments of server 120 or computer 150 with animplementation of applications 372 stored thereon could be locatedlocally or remotely and communicate with one or more additional serversand/or one or more other computers/workstations or instruments.

In some embodiments, applications 372 may also be enabled to encryptdata such as one or more data files that will be described in greaterdetail below, where the encrypted data may then be distributed overnetwork 125 to one or more other computers or servers. For example, someembodiments of probe array 140 may be employed for diagnostic purposeswhere the data may be associated with a patient and a diagnosis of adisease or medical condition. It is desirable in many applications toprotect the data using encryption for confidentiality of patientinformation. In addition, one-way encryption technologies may beemployed in situations where access should be limited to only selectedparties such as a patient and their physician. In the present example,only the selected parties have the key to decrypt or associate the datawith the patient. In some applications, the one-way encrypted data maybe stored in one or more public databases or repositories where even thecurator of the database or repository would be unable to associate thedata with the user. The described encryption functionality may also haveutility in clinical trial applications where it may be desirable toisolate one or more data elements from each other for the purpose ofconfidentiality and/or removal of experimental biases.

Applications 372 may, in the present implementation, provide one or moreinteractive graphical user interfaces that allows user 101 to makeselections based upon information presented in an embodiment of GUI 246.Those of ordinary skill will recognize that embodiments of GUI 246 maybe coded in various language formats such as an HTML, XHTML, XML,javascript, Jscript, or other language known to those of ordinary skillin the art used for the creation of enhancement of “Web Pages” viewableand compatible with internet client 379. As described above with respectto internet client 279, internet client 379 may include various internetbrowsers such as Microsoft Internet Explorer, Netscape Navigator,Mozilla Firefox, Apple Safari, or other browsers known in the art.Applications of GUI's 246 viewable via one or more internet typebrowsers may allow user 101 complete remote access to data, management,and registration functions without any other specialized softwareelements. Applications 372 may provide one or more implementations ofinteractive GUI's 246 that allow user 101 to select from a variety ofoptions including data selection, experiment parameters, calibrationvalues, and probe array information within the access to data,management, and registration functions.

In some embodiments, applications 372 may be capable of running onoperating systems in a non-English format, where applications 372 canaccept input form user 101 in various non-English language formats suchas French, Spanish etc., and output information to user 101 in the sameor other desired language output. For example, applications 372 maypresent information to user 101 in various implementations of GUI 246 ina language output desired by user 101, and similarly receive input fromuser 101 in the desired language. In the present example, applications372 is internationalized such that it is capable of interpreting theinput from user 101 in the desired language where the input isacceptable input with respect to the functions and capabilities ofapplications 372.

Embodiments of applications 372 also include instrument controlfeatures, where the control functions of individual types or specificinstruments such as scanner 100, autoloader 110, or fluid handlingsystem 115 may be organized as plug-in type modules to applications 372.For example, each plug-in module may be a separate component such asplug-in module 373 and may provide definition of the instrument controlfeatures to applications 372 where each plug-in module 373 isfunctionally integrated with executables 372A when stored in systemmemory 370. In the present example, each instrument may have one or moreassociated embodiments of plug-in module 373 that for instance may bespecific to model of instrument, revision of instrument firmware orscripts, number and/or configuration of instrument embodiment, etc.Further, multiple embodiments of plug-in module 373 for the sameinstrument such as scanner 100 may be stored in system memory 370 foruse by applications 372, where user 101 may select the desiredembodiment of module 373 to employ, or alternatively such a selection ofmodule 373 may be defined by data encoded directly in a machine readableidentifier as described below or indirectly via the array file, libraryfiles, experiments files and so on.

The instrument control features may include the control of one or moreelements of one or more instruments that could, for instance, includeelements of a hybridization device, fluid handling system 115,autoloader 110, and scanner 100. The instrument control features mayalso be capable of receiving information from the one more instrumentsthat could include experiment or instrument status, process steps, orother relevant information. The instrument control features could, forexample, be under the control of or an element of the interface ofapplications 372. In some embodiments, a user may input desired controlcommands and/or receive the instrument control information via one ofGUI's 246. Additional examples of instrument control via a GUI or otherinterface is provided in U.S. patent application Ser. No. 10/764,663,which is hereby incorporated by reference herein in its entirety for allpurposes.

In some embodiments, applications 372 may employ what may referred to asan “array file”, represented in FIG. 4 as array file 407 that comprisesdata employed for various processing functions of images by applications372 as well as other relevant information. Generally it is desirable toconsolidate elements of data or metadata related to an embodiment ofprobe array 140, experiment, user, or some combination thereof, to asingle file that is not duplicated (i.e. as embodiments of .dat file 415may be in certain applications), where duplication may sometimes be asource of error. The term “metadata” as used herein generally refers todata about data. It may also be desirable in some embodiments torestrict or prohibit the ability to overwrite data in array file 407.Preferentially, new information may be appended to the file providingthe benefit of traceability, and data integrity (i.e. as may be requiredby some regulatory agencies). For example, array file 407 may beassociated with one or more implementations of an embodiment of probearray 140, where array file 407 acts to unify data across a set of probearrays 140. Array file 407 may be created by applications 372 via aregistration process, where user 101 inputs data into applications 372via one or more of GUI's 246. In the present example, array file 407 maybe associated with a custom identifier such as a machine readableidentifier that could include identifiers described in greater detailbelow. Alternatively, applications 372 may create array file 407 andautomatically associate array file 407 with a machine readableidentifier that identifies an embodiment of probe array 140.Applications 372 may employ various data elements for the creation orupdate of array file 407 from one or more library files, such as libraryfiles 274 or other library files, where the information may be providedby a manufacturer of probe array 140 and define characteristics such asprobe location and identity; dimension and positional location (i.e.with respect to some fiducial reference) of the active area of probearray 140; various experimental parameters; instrument controlparameters; or other types of useful information. In addition, arrayfile 407 may also contain one or more metadata elements that couldinclude one or more of a unique identifier for array file 407, humanreadable form of a machine readable identifier, or other metadataelements. In addition, the applications 372 may store data (i.e. asmetadata, or stored data) that includes sample identifiers, array names,user parameters, event logs that may for instance include a valueidentifying the number of times an array has been scanned, relationshiphistories such as for instance the relationship between each .cel fileand the one or more .dat files that were employed to generate the .celfile, and other types of data useful in for processing and datamanagement.

For example, user 101 and/or automated data input devices or programs(not shown) may provide data related to the design or conduct ofexperiments. User 101 may specify an Affymetrix catalogue or custom chiptype (a catalog array such as the Mapping 6.0 Array) either by selectingfrom a predetermined list presented in one or more of GUI's 246 or byscanning a bar code, Radio Frequency Identification (RFID), magneticstrip, or other means of electronic identification related to a chip toread its type. Applications 372 may associate the chip type with variousscanning parameters stored in data tables or library files, such aslibrary files 274 of computer 150, including the area of the chip thatis to be scanned, the location of chrome elements or other features onthe chip used for auto-focusing, the wavelength or intensity/power ofexcitation light to be used in reading the chip, and so on. Also,applications 372 may encode array files 407 in a binary type format thatmay minimize the possibility of data corruption. However, applications372 may be further enabled to export array file 407 in a number ofdifferent formats.

Also, in the same or alternative embodiments, applications 372 maygenerate or access what may be referred to as a “plate” file. The platefile may encode one or more data elements such as pointers to one ormore array files 407, and preferably may include pointers to a pluralityof array files 407.

In some embodiments, raw image data is acquired from scanner 100 andoperated upon by applications 372 to generate intermediate results. Forexample, raw intensity data 405 acquired from scanner 100 may bedirected to .dat file generator 410 and written to data files (*.dat)such as .dat file 415 that comprises an intensity value for each pixelof data acquired from a scan of an embodiment of probe array 140. In thesame or alternative embodiments it may be advantageous to scan sub areas(that may be referred to as sub arrays) of probe array 140 where rawintensity data 405 for each sub area scanned may be written to anindividual embodiment of .dat file 415. Continuing with the presentexample, applications 372 may also include unique identifier assignor460 that encodes a unique identifier for .dat file 415 as well as apointer to an associated embodiment of array file 407 as metadata intoeach .dat file 415 generated. The term “pointer” as used hereingenerally refers to a programming language datatype, variable, or dataobject that references another data object, datatype, variable, etc.using a memory address or identifier of the referenced element in amemory storage device such as in system memory 370. In some embodimentsthe pointers comprise the unique identifiers of the files that are thesubject of the pointing, such as for instance the pointer in .dat file415 described above comprises the unique identifier of array file 407.Additional examples of the generation and image processing of sub arraysis described in U.S. patent application Ser. No. 11/289,975, which ishereby incorporated by reference herein in its entirety for all purpose.

Also, applications 372 may also include .cel file generator 420 that mayproduce one or more .cel files 425 (*.cel) by processing each .dat file415. Alternatively, some embodiments of .cel file generator 420 mayproduce a single .cel file 425 from processing multiple .dat files 415such as with the example of processing multiple sub-arrays describedabove. Similar to .dat file 415 described above each embodiment of .celfile 425 may also include one or more metadata elements. For example,assignor 460 may encode a unique identifier for each .cel file 425 aswell as a pointer to an associated array file 407 and/or the one or more.dat files 415 used to produce the .cel file 425.

Each .cel file 425 contains, for each probe feature scanned by scanner100, a single value representative of the intensities of pixels measuredby scanner 100 for that probe. For example, this value may include ameasure of the abundance of tagged mRNA's present in the target thathybridized to the corresponding probe. Many such mRNA's may be presentin each probe, as a probe on a GENECHIP® probe array may include, forexample, millions of oligonucleotides designed to detect the mRNA's.Alternatively, the value may include a measure related to the sequencecomposition of DNA or other nucleic acid detected by the probes of aGENECHIP® probe array. As described above, applications 372 receivesimage data derived from probe array 140 using scanner 100 and generates.dat file 415 that is then processed to produce .cel intensity file 425,where applications 372 may utilize information from array file 407 inthe image processing function. For instance, .cel file generator 420 mayperform what is referred to as grid placement on the image data in .datfile 415 using data elements such as dimension information to determineand define the positional location of probe features in the image.Typically, .cel file generator 420 associates what may be referred to asa grid with the image data in a .dat file for the purpose of determiningthe positional relationship of probe features in the image with theknown positions and identities of the probe features. The accurateregistration of the grid with the image is important for the accuracy ofthe information in the resulting .cel file 425. Also, some embodimentsof .cel file generator 420 may provide user 101 with a graphicalrepresentation of a grid aligned to image data from a selected .dat filein an implementation of GUI 246, and further enable user 101 to manuallyrefine the position of the grid placement using methods commonlyemployed such as placing a cursor over the grid, selecting such as byholding down a button on a mouse, and dragging the grid to a preferredpositional relationship with the image. Examples of grid registrationand methods of positional refinement are described in U.S. Pat. Nos.6,090,555; 6,611,767; 6,829,376, and U.S. patent application Ser. Nos.10/391,882, and 10/197,369, each of which is hereby incorporated byreference herein in its entirety for all purposes.

As noted, another file that may be generated by applications 372 is .chpfile 435 using .chp file generator 430. For example, each .chp file 435is derived from analysis of .cel file 425 combined in some cases withinformation derived from array file 407, other lab data and/or libraryfiles 274 that specify details regarding the sequences and locations ofprobes and controls. The resulting data stored in .chp file 435 includesdegrees of hybridization, absolute and/or differential (over two or moreexperiments) expression, genotype comparisons, detection ofpolymorphisms and mutations, and other analytical results.

In some alternative embodiments, user 101 may prefer to employ differentapplications to further process or perform higher level/specializedanalysis such as analysis application 380. Various embodiments ofanalysis application 380 may exist such as applications developed by themanufacturer for specialized embodiments of probe array 140, commercialthird party software applications, open source applications, or otherapplications known in the art for specific analysis or high levelanalysis of data from probe arrays 140. Applications 372 may be enabledto export .cel files 425, .dat files 415, or other files to analysisapplication 380 or allow enable access to such files on computer 150 byanalysis application 380. Such functionality may be enabled by one ormodules as described above with respect to plug-in module 373.

Additional examples of .cel and .chp files are described with respect tothe Affymetrix GENECHIP® Operating Software or Affymetrix MicroarraySuite (as described, for example, in U.S. Pat. No. 7,031,846 or U.S.patent application Ser. No. 10/764,663, both of which are herebyincorporated herein by reference in their entireties for all purposes).For convenience, the term “file” often is used herein to refer to datagenerated or used by applications 372 and executable counterparts ofother applications such as analysis application 380, where the data iswritten according a format such as the described .dat, .cel, and .chpformats. Further, the data files may also be used as input forapplications 372 or other software capable of reading the format of thefile.

Some embodiments of applications 372 may be enabled to store and managedata stored in a file format or file based system. For example, a filebased system may provide a high degree of flexibility over Database typestorage formats where the database formats may require knowledge of aparticular data model or organization of data in order to workeffectively. In the present example, file based systems are not bound bysuch formatting constraints, thereby allowing greater flexibility touser 101 and developers of third party software elements. For instance,embodiments of application 380 enabled to process files generated byapplications 372. In the same or alternative examples, user 101 and/orthe third party developers may employ what are referred to as softwaredevelopment kits that enable programmatic access into file formats, orthe structure of applications 372. Therefore, other softwareapplications may integrate with and seamlessly add functionally to orutilize data from applications 372 that provides user 101 with a widerange of application and processing capability. Additional examples ofsoftware development kits associated with software or data related toprobe arrays are described in U.S. Pat. No. 6,954,699, and U.S.application Ser. Nos. 10/764,663 and 11/215,900, each of which is herebyincorporated by reference herein in its entirety for all purposes.

Some embodiments of applications 372 may employ a system of filemanagement that employs a method or data structure that utilizes aunique identifier associated with each file and a system of pointerswithin files that identify relationships between the files. Thepresently described system has advantages over database type methods ofstoring and managing probe array information for a number of reasons.First, a file based system opens the results and data produced by thesoftware platform to use by third party software. Second, the file basedsystem allows users flexibility to organize and store data in a mannerthat is preferred by the users and more amenable to their work flow anddata management. Third, in the presently described file based system,all data related to the experiments, probe arrays, results, etc. isstored in the files. In other words, there are no separate databases ofexperiment information or the like that must be queried to obtain neededdata for processing.

Embodiments of the unique identifier are independent of file names orother commonly used identifiers. One advantage of associating a uniqueidentifier with each file is that it allows for the changing of filenames by user 101, where the unique identifier still allows the file tobe organized in a particular relationship with other files independentof the file name. For example, some management systems employ the nameof a particular file to track and identify the file such that therelationship with a first file to one or more other files is dependentupon the name of the first file. In the present example, name of thefirst file is changed or modified in any way, the relationships to otherthe one or more other files may be lost. Whereas utilizing a uniqueidentifier embedded as metadata within the file may be protected fromoverwriting and thus the integrity of relationships that depend upon theidentifier is more stable.

Methods of generating unique identifiers may be accomplished in avariety of ways and can include a variety of non-random elements such asone or more of time based identifiers, machine or system identifiers,network identifiers, laboratory identifiers, user identifiers,identifiers particular to the experiment or application, or site basedidentifiers. Other elements of a unique identifier may also include oneor more randomly generated identifiers, or other types of random andnon-random identifiers known to those of ordinary skill in the relatedart. Those of ordinary skill in the art will appreciate that a uniqueidentifier may comprise one or more of the elements described above orany combination thereof. For example, applications 372 may employalgorithm that generates unique identifiers comprising a plurality ofelements arranged in a particular order. The elements may includeelements in the following arrangement: Time-NetworkAddress-Random-Random. In the present example, the arrangement ofelements may comprise a string of characters and the time element mayinclude a reference to system time (i.e. computer system such ascomputer 150), Greenwich Mean Time, or other standard time reference andthe random elements may comprise strings of random characters such asnumbers, letters, symbols, or other commonly employed characters.

In the presently described embodiments, the relationship between filesmay be arranged in a variety of ways. In one embodiment, applications372 employs a file management data structure organized in ahierarchical-like format such as for instance a tree-like hierarchicalstructure where a primary file(s) comprises the “root” of the treestructure and subsequent tiers of files represent dependencies of eachfile on the data in the file from the tier or tiers above. Typically,the tiers may be viewed as having a “parent-child” type relationshipwhere each parent file in a respective tier may have one or more childfiles in the tier below such as for instance each .dat file may be theparent to one or more .cel files in the tier below. Advantageously, thedescribed file management structure provides user 101 with completedownstream traceability of files derived from information in the rootfile and tiers above. The present example of a hierarchical structure isused for the purposes of explanation of the nature of relationshipsbetween files and should not be confused with other types of tree-likedata structure known in the art. For example, the .dat file maybeconsidered the root file for all subsequent downstream files where asecond tier comprises one or more .cel files derived from the .dat file,and a third tier may comprise one or more .chp files derived from each.cel file, where a file in each respective tier comprises a pointer tothe child file in the tier below, and all files comprise a reference tothe unique identifier associated with a common array file. In thepresent example, one or more .cel files may be processed from a single.dat file where each .dat file includes a pointer to the uniqueidentifier of the .cel file. Further, one or more .chp files may begenerated from each .cel file where each .chp includes a pointer to theunique identifier of the .cel file from which it was generated, and insome embodiments may also include a pointer to the .dat and/or arrayfile from which the .cel file was generated.

Additionally, embodiments of applications 372 may include file indexer450 that utilizes and maintains a small (i.e. maintains a minimal amountof information) database for the purpose of searching and identifyingfiles or specific data elements of interest. Such a database may includecache database 455 that comprises data that duplicates data computedearlier and/or stored elsewhere. For example, it may be advantageous toprovide cache database 455 for use in searching for files or specificelements contained within the files such as the .dat, .cel, .chp, andarray files. In the present example, cache database 455 comprises themetadata of each file organized in the database according to a preferreddata model. Additional data stored in cache database 455 for each filecould also include memory addresses, current file names, file size,date/time stamps, electronic signatures, or other information that doesnot include probe array data such as raw or processed intensity values.Such a database provides an advantage because the alternative is to openeach of the files until the desired information is obtained. In someembodiments, indexer 450 comprises a search engine to find various filesor specific data elements within the database. Also user 101 may employan implementation of GUI 246 to create search queries for files orspecific data elements where input/output manager 430 may provide GUI246 and direct search queries to indexer 450.

Analysis Application 380: Analysis Application 380 may comprise any of avariety of known or probe array analysis applications, and particularlyanalysis applications specialized for use with embodiments of probearray 140 designed for genotyping applications. Additional examples ofgenotyping analysis applications may be found in U.S. patent applicationSer. Nos. 10/657,481; 10/986,963; and 20050287575; each of which ishereby incorporated by reference herein in it's entirety for allpurposes. Typically, embodiments of applications 380 may be loaded intosystem memory 270 and/or memory storage device 281 through one of inputdevices 240.

Some embodiments of applications 380 include executable code beingstored in system memory 270, illustrated in FIG. 3 as instrument controland analysis applications executables 380A. As illustrated in FIG. 4,Analysis Application Executables 380A may receive one or more files frominput/output manager 430. For example, Analysis Application Executables380A may be capable of specialized analysis of processed data, such asthe data in .cel file 425. In the present example, user 101 may desireto process data associated with a plurality of implementations of probearray 140 and therefore Analysis Application Executables 380A wouldreceive a .cel file 425 for processed from each probe array. In thepresent example, manager 430 forwards the appropriate files in responseto queries or requests from Analysis Application Executables 380A.

Analysis Application Executables 380A may receive each of .cel files 425and analyze the data using one or more algorithms to determine agenotype call for each SNP represented by a probe set (i.e. set of oneor more probes that interrogate the same target), and one or moremeasure of quality or confidence associated with the genotype call.

Analysis Application Executables 380A may in preferred applicationsanalyze all .cel files 425 in parallel, where for instance higherquality results may be obtained using the combination of data elementsfrom each .cel file 425. Initially, Analysis Application Executables380A will “normalize” the intensity data from each of files 425. Theterm “normalize” as used herein generally refers to performing a processof comparing and adjusting intensity values in each .cel file to a samescale or range such that the intensity values from each of the files iscomparable to one another. Analysis Application Executables 380A mayemploy a variety of normalization methods that may include but are notlimited to quantile normalization, or sketch normalization.

In some embodiments, Analysis Application Executables 380A may alsodetermine an initial assignment for each SNP genotype using a variety ofmethods. In some embodiments, Analysis Application Executables 380A mayperform this function in parallel to the normalization described above.For example, Analysis Application Executables 380A may employ what isreferred to as Dynamic Modeling (DM) methods to make the initialassignment of genotype, where the intensity values are fit to models,and the genotype is determined by the best fit of the data for each SNPto a particular genotype model. Additional examples of dynamic modelingalgorithms are described in U.S. patent application Ser. Nos.10/657,481; 10/986,963; and Ser. No. 11/157,768; incorporated byreference above.

Analysis Application Executables 380A then identifies a minimum numberof instances of each of the three genotype calls (i.e. AA, AB, BB) forthe initial assignments and uses these identified instances to estimatethe prior distribution on typical cluster centers andvariance-covariance matrices. Next, Analysis Application Executables380A may process the data associated with each SNP by combining thecluster centers and variances with the data employing what is referredto as a Bayesian method (see “Bayesian Data Analysis,” by Andrew Gelman,John B. Carlin, Hal S. Stem, and Donald B. Rubin, hereby incorporated byreference in its entirety for all purposes, 2nd edition, Boca Raton,Fla.: Chapman & Hall/CRC, c2004) to derive a posterior estimate ofcluster centers and variances. Lastly, Analysis Application Executables380A assigns a genotype and confidence score for each SNP according towhat is referred to as its Mahalanobis distance (distance rescaled bythe variance & covariance) from the three cluster centers.

Analysis Application Executables 380A may return the genotype values toInstrument control and image processing applications 372 for processinginto a file format or alternatively Analysis Application Executables380A may generate a file. Some or all of the SNP results including thegenotype calls and/or confidence values may also be presented to user101 in one or more GUIs 246.

Highly accurate and reliable genotype calling is an essential componentof any high throughput SNP genotyping technology. BRLMM, the commercialmethod for the Mapping 500K product sold by Affymetrix, Santa Clara, iseffective, but requires the presence of mismatched probes (MM) probes onthe nucleic acid array to create “seed” genotypes (seed genotypes areanother term for initial assignments). One embodiment of the presentinvention is a method that only uses perfect-match probes, BRLMM-P. Onedifference between BRLMM-P and BRLMM is that BRLMM-P derives seedgenotypes directly from the clustering properties of the data (asopposed to BRLMM's reliance on initial genotype seeds from a softwarecalled “Dynamic Model” or DM (See U.S. Ser. Nos. 10/657,481; 10/986,963;and 11/157,768 previously incorporated by reference). Other differencesexist, such as using only the most informative dimension for clusteringand some modifications to the exact choices for likelihood function.

As an extension of the RLMM concept, one presently preferred embodiment,BRLMM-P (like BRLMM) performs a multiple chip analysis, enabling thesimultaneous estimation of probe effects and allele signals for eachSNP. See the following references which are incorporated in theirentireties for a disclosure of RLMM concept; Xiaojun Di, et al.,“Dynamic model based algorithms for screening and genotyping over 100KSNPs on oligonucleotide microarrays”. Bioinformatics 200521(9):1958-1963; Nusrat Rabbee and Terence P. Speed, “A genotype callingalgorithm for Afjymetrix SNP arrays” UC Berkeley Statistics Online TechReports, August 2005 and Nusrat Rabbee and Terence P. Speed. “A genotypecalling algorithm for Afjymetrix SNP arrays” Bioinformatics AdvanceAccess published online on November 2,2005.

FIG. 5 presents an overview of the BRLMM-P approach, which is one aspectof the preferred embodiment. The first step is to normalize the probeintensities and estimate allele signal estimates for each SNP in eachexperiment. The allele signal estimates are then transformed to a2-dimensional space in which the underlying genotype clusters are ‘wellbehaved’ in terms of having similar variance for each of the clusters.Since the primary discriminator of genotype is the “contrast” dimension,the “size” dimension is discarded. In the resulting I-dimensional space,for each SNP, we evaluate the posterior likelihood of all plausibledivisions of the observed data into three (or fewer) seed genotypesusing a Gaussian likelihood model combined with prior information. Thehighest likelihood divisions of the data into plausible genotypes areretained, and combined to form a final estimate of seed genotypeassignments. These final seeds are combined with the data to form aposterior distribution summarizing the best current estimate of genotypecluster center and variance for the SNP. Finally, a genotype andconfidence score are assigned for each observation according to therelative distance to the cluster centers.

We now briefly discuss the general means for efficiently implementingthe computations used by BRLMM-P before going into specific details. Inone currently preferred embodiment, we have the goal of assigninggenotypes BB, AB, AA to N data points obtained from chips hybridized tosamples. In this embodiment, this can be accomplished efficiently andaccurately using a technique designed to optimize clustering metrics.One preferred embodiment is called BRLMM-P and the algorithm workflow isshown in FIG. 5. The methods embodied in BRLMM-P can be divided into twotypes: first, the choice of the clustering metrics used to evaluate apotential assignment, and second, the efficient evaluation of suchassignments to find a sufficiently good assignment of genotypes.

A typical set of clustering metrics is to use the log-likelihood of thedata under a Gaussian cluster model. In one embodiment, there are threeclusters corresponding to the three genotypes, each cluster is assumedto be approximately normally distributed with an individual mean andvariance, and the log-likelihood of a given data point assigned to acluster is the usual Gaussian log-likelihood. For such clusteringmetrics, in one embodiment of the method, the task is to find anassignment of datapoints to genotypes so that the log-likelihood ismaximized. However, the naïve approach is computationally infeasible,requiring evaluation of an exponential number of possibilities. TheBRLMM-P method therefore exploits the structure of the problem toefficiently optimize over plausible genotypes.

For example, given N data points, there are 3{circumflex over ( )}Npossible genotypes that can be assigned to those points (BB, AB, or AA).However, there is often a natural ordering of the data (i.e. more Ballele to less B allele intensity) which leads to only O(N²) plausiblelabels for the data points (i.e. BB^(a) AB^(b) AA^(c)), because the AAgenotype is always to the “right” of the AB genotype, which is always tothe “right” of the “BB” genotype (when plotted as a difference between Aalleles and B alleles). This implies it is possible to efficientlyexamine all O(N²) plausible assignments, and evaluate which genotypeassignment fits the data best, in order to genotype samples.

In particular, with the use of running sums, we can evaluate mean andvariance of genotype clusters, and compute log-likelihoods, in O(1) timeper plausible labeling, for O(N{circumflex over ( )}2) overall time.That is, given N+1 numbers (O,z1, . . . ,zn), the method can computetheir running sums (O,z1,z1+z2,z1+z2+z3, . . . ) in O(N) time, andcompute the mean of a set of data zi . . . zj in O(1) time by simplysubtracting the ith running sum from the j+1st running sum, rather thanadding zi, . . . zj. Similarly, the method can compute the varianceusing running sums of squares. Thus, the method can evaluate anylabeling in O(1) time per labeling, provided that the method iscomputing likelihoods depending only on the mean, variance, or otherquantities that can be computed by running sums. This therefore allowsthe overall time cost of the method to be O(N{circumflex over ( )}2).

Unlike EM (iterative expectation-maximization methods), which moves fromcluster assignment to cluster assignment iteratively improving the fit(a universal method), this procedure evaluates >all< plausibleassignments of three genotypes to the data (specifically tuned for thisproblem). This prevents the method from being stuck in a local minimum,or failing to make progress due to a bad initial assignment of trialgenotypes.

The remainder of the discussion below, steps through each of the abovesteps in detail and then presents a detailed assessment of BRLMM-Pperformance.

FIGS. 6-12 show a method of the present invention which calls genotypesby only using the “contrast” values for each data point. This is aone-dimensional clustering problem in contrast space. There is no needfor a Dynamic Model algorithm to provide seed genotypes for thisclustering method, because it tries all plausible assignments of seedgenotypes, and picks the one that makes the observed data most likely.The present method fits the data using Gaussian clusters (one pergenotype) and fits the (transformed) data in 1-D contrast space. SeeFIG. 6. The log-likelihood of the data given the clustering is used todecide which trial genotype is best.

The data is divided into trial genotype assignments as shown in FIG. 7.The trial genotype assignments are read from left to right asBB(2)->AB(1)->AA(0) (Green=BB, Red=AB, and Black=AA). Each genotypecluster has a mean which is the weighted combination of data plus priorknowledge. If there is no observed data for a genotype, then the clusterparameters are inferred using only the prior data. If there is a largequantity of observed data for a genotype, then the effect of the prioron cluster parameters is minimal, and the parameters are mostly obtainedfrom the observed data. In the typical embodiment, the varianceparameter is fitted to all three clusters to be the same value. Thelog-likelihood shows how well the data fits the assignment resultingfrom this division.

FIG. 8 shows two dividing lines for trial genotype seeds:BB(2)->AB(1)->AA(0). When there are fewer than 3 seed genotypes, themissing cluster must be inferred from the prior. See FIG. 9. The methodtries all (n+1)(n+2)/2 possible divisions of the data as trial genotypeassignments, and the fit is evaluated by log likelihood of data as inFIG. 10. Once the log-likelihood is evaluated for each trial division ofthe data, the method infers final genotype cluster centers and variancefrom a weighted combination of the most likely divisions of data. Priorinformation is used to fill in the cluster parameters for genotypes notobserved in the data.

Given the cluster centers and variances, the genotype for each datapoint is called and assigned a confidence based on the most likelycluster membership. This confidence score provides additionalinformation beyond simply making a “call” of the genotype. Confidencescores indicating an uncertain genotype may be discarded by a downstreamuser to obtain only high-quality data. The confidence for the genotypecalls is shown in FIG. 11. In one embodiment, the most probable clustermembership, X, (for three clusters X,Y,Z) is calculated byPr(X)>Pr(Y)>Pr(Z), where Pr(X)+Pr(Y)+Pr(Z)=1. The confidence score iscalculated by Pr(Y)+Pr(Z)=1-Pr(X). A confidence score near zero is ahighly confident call for the genotype being X.

The above discussion has assumed that the clustering metrics are thebasic Gaussian model. However, in practice, these metrics can beimproved. One direction of improvement is to provide strongerassumptions on the structure of the clusters. A common problem withclustering methods is that increasing the numbers of clusters improvesthe likelihood. If there are only two genotypes present, on occasion thelikelihood will be improved by finding three clusters. This may alsooccur due to the data not actually being distributed as the Gaussianmodel requires. See FIG. 12. This problem can be reduced using“hard-shell” restrictions of various types. In one implementation,cluster centers cannot be closer than some minimum value, in another,cluster centers that are too close lead to a penalty to thelog-likelihood for that trial division of genotypes. Such constraintscan provide “hard barriers” forbidding some labelings that wouldsub-divide clusters. Since such constraints (i.e. mean AA—mean AB mustbe larger than some minimum value) can be evaluated from the mean andvariance alone, they do not contribute extra computation aboveO(N{circumflex over ( )}2) time.

There are also techniques designed to accommodate variations in how thedata responds to the analysis and methods can be employed to customizethe analysis in situations where the actual results do not fit what isobserved. For example, the user can place a Bayesian prior on thelikelihood, so that BB genotypes are likely to have high B allele probeintensities, and AB genotypes are likely to have intensitiesapproximately balanced between B allele and A allele probes.Additionally, the labeling is performed with respect to some linearorder—the likelihood evaluation can be done in a multidimensional space(i.e. one can use a log-likelihood for all 2*k probes in a probe set),without disturbing the O(N{circumflex over ( )}2) time. A SNP-specificprior can be placed on the likelihood, and use information previouslyobtained on the distribution of data. Again, since the Bayesian updatesonly use means and variances, they may be evaluated in O(N{circumflexover ( )}2) time using the above running sums techniques.

For additional efficiency, the O(N{circumflex over ( )}2) time can bereduced to O(N) time using binning techniques. For example, the methodcan “bin” points in [4,1], into [−1,−.99), [−0.99,−.98], . . . [0.99,1],and work with the summary statistics of each bin for evaluating thelikelihood. This limits the computational load to O(#bins{circumflexover ( )}2), which can be much smaller than N. (there is always an O(N)step to read in the data). Instead of outputting the “average labeling”,one can do a single maximization step and compute the means andvariances of each genotype and output those to use as a classifier forunknown data.

This preferred embodiment can provide a useful method of generatinggenotype assignments with high call rate and accuracy.

In the preferred embodiment, there are methods designed to customize theanalysis for problems encountered in practice. For example, onepotential problem with observed data is cluster shifting. The clustersfor one data set may be positioned differently than the clusters from aprior set of experiments. This may be incorporated into the model byallowing some “wobble” in the cluster centers. This is easilyincorporated by simply weakening the priors. Even if there are a largenumber of observations for a given genotype in the prior, do notincrease the prior strength above some value (say, 10 observationsequivalent strength).

Another method for customizing the analysis is to use isotonicregression instead of unconstrained Bayesian regression for the clustercenters. Isotonic regression is a general family of techniques designedto produce regression fits that maintain some monotonic property, suchas always increasing or decreasing. The cluster centers should be in theorder BB, AB, AA moving from left to right in contrast (difference)space. Ordinary Bayesian regression can lead to situations in which thisconstraint is not true, given odd data configurations. This may besolved by using a weighted form of isotonic regression on the clustercenters to find the “closest” valid configuration of cluster centers tothe naïve cluster centers. In fact, we may further wish to constrainthat AB>BB+delta and AA>AB+delta so that all cluster centers areseparated by a minimum distance, delta. Let mB,mH,mA be the originalcluster centers, with posterior weight wB,wH,wA (number ofpseudo-observations). We may use the Pool-Adjacent-Violators algorithmto generate xB,xH,xA, new cluster centers satisfying our condition, andmoving the clusters with the least support in the data the most. Letgamma=delta*(wB−wA)/(wB+wH+wA), and xB=mB−gamma+delta, xH=mH−gamma,xA=mA−gamma−delta. Now apply PAV,if(xB>xH){xB=xH=(wB*xB+wH*xH)/(wB+wH)},if(xH>xA){xH=xA=(wH*xH+wA*xA)/(wH+wA), if(xB>xH){xB=xH=xA=(wB*xB+wH*xH+wA*xA)/(wB+wH+wA))). After this step, itmust be the case that xB<xH<xA. Now updatexB=xB+gamma−delta,xH=xH+gamma,xA=xA+gamma+delta. At this stage, nowxB+delta<xH, and xH+delta<xA. Thus, the new cluster centers areseparated by at least delta and arein the correct order.

An additional method to customize the analysis is to allow the variancesto be different from one genotype cluster to another. The parameter“lambda” has been introduced to control the amount of mixing betweencluster centers. The estimate of common variance betweenclusters=(wA*varA+wB*varB+wH*varH)/(wA+wB+wH), can be modified tovarX=(wX*varY*(3−2*lambda)+wY*vary*lambda+wZ*varZ*lambda)/(wX*(3−2*lambda)+wY*lambda+wZ*lambda)for each cluster. Thus, the points in each cluster count more towardsthe estimate of that cluster, without necessarily requiring all clustersto have the same variance.

Another method for customization is to add BIC, the Bayesian InformationCriterion to the cluster evaluation. This is a penalty to the likelihoodfor having observations in only one, two, or three genotype clusters.Each cluster observed penalizes the likelihood by k*log(n), where k is atuning parameter (usually 2, for mean and variance) and n is the numberof data points. This penalizes having more clusters than is justified bythe data.

A further way to customize the analysis is to add a mixture frequencypenalty to the log-likelihood. This is a penalty to the likelihood forhaving observations in clusters with low frequency. We add to thelikelihood the “frequency” of observing data—each cluster has a numberof observations r, and we add to the likelihood r*log(r). Over all threeclusters, this is n*entropy of the distribution. We also penalize thedecisions of calling a data point by adding the frequency to thelikelihood of being in a cluster. This brings the likelihood closer tothe standard mixture model. We can instead of using the observedfrequency also use the prior number of observations as well for thelikelihood (r*log(r+s)) where s is the prior strength.

Another consideration to add to the analysis is to address Standard CopyNumber Variations (copy number variations that occur frequently enoughamongst wild-type humans to require handling as special cases). ChrX isthe classic example of this phenomenon where males have 1 copy (2cluster centers) and females have 2 copies (3 cluster centers). Otherexamples are chrY, with 1 copy in males (2 cluster centers) and O copiesin females (always no calls), and mitochondrial with 1 copy in allindividuals (2 cluster centers). This can be handled by subsetting thesamples based on estimated copy number of the SNP and fitting modelsonly within copy number strata.

We discuss below some specific steps in the default embodiment of theinvention, providing technical details beyond the above generaldiscussion.

Normalization and Allele Summarization-FIG. 5 shows the BRLMM-Palgorithm workflow. For example, the normalization and allelesummarization steps of the BRLMM-P method consist of producing a summaryvalue for each allele of a SNP in each experiment. The “A” allelesummary value increases and decreases with the quantity of the “A”allele in the target genome, and similarly the “B” allele summary valueincreases and decreases with the quantity of the “B” allele in thetarget genome. These summary values are calculated to remove extraneouseffects—chip-chip variation, background, and the relative brightness ofdifferent probes on the array. This section explains the technicaldetails of this summarization process, which is similar to that used onexpression arrays.

For each SNP of interest, the array contains multiple probes designed tohybridize to each allele of the SNP. The intensities of these featurestypically vary together in systematic ways for each genotype of the SNP.We therefore summarize these intensities in a single value for thefeatures corresponding to each allele, the “signal” for that allele.(Note: due to crosshybridization with the alternate allele, this signaldoes not directly correspond to the concentration of the perfectlymatched allele.) The intensities of the probes matched to the “A” alleleare expected to decrease with decreasing quantities of the “A” allele,and similarly for the “B” allele probes. Since these change in oppositedirections, we summarize the probes for each allele as independentsignals. Therefore, for each SNP in each experiment, we obtain twovalues—an “A” signal and a “B” signal, which summarize the probes.

From the field of expression analysis on arrays, we know how tosummarize several probes to a single signal value effectively. We needto account for extraneous effects on the probe intensity that vary fromexperiment to experiment (normalization), account for potentialdifferences in background from chip to chip (background adjustment), andaccount for the systematic differences in feature intensity due to probecomposition (feature effects). For the SNP 5.0 array (Affymetrix, SantaClara, Calif.) the multiple features used to interrogate each allelehave identical probe sequences, but even so we still use an approachthat allows for systematic differences between probes from sources otherthan probe composition. While there are many options available for eachof these effects, we have chosen to use standard solutions from theliterature: quantile normalization at the feature level, no backgroundadjustment, a log-scale transformation for the perfect matchintensities, and a median polish (a robust method of fitting a model) tofit feature effects to the data obtaining a signal. This is exactly thesame methodology that can be applied to summarize an expression arrayand produce a signal for a probe-set.

Quantile normalization is performed as in the literature, see Bolstad,et al., A Comparison of Normalization Methods for High DensityOligonucleotide Array Data Based on Bias and Variance, Bioinformatics19,2, pp 185-193. The intensities on each chip are ranked, and then theaverage intensity across experiments for each rank of intensity issubstituted within each experiment for the given rank. [If R(I) is therank of intensity within a chip, and Q(R) is the average intensity for agiven rank, the quantile normalized intensity within a chip is Q(R(I))].Because the quantile function is slowly varying and smooth, weapproximate the Q(R) function for each chip with a linear interpolationfor processing speed [“sketch” normalization]. This allows us tonormalize millions of data points per chip rapidly with compactsummaries of the data.

One preferred embodiment uses no adjustment for background. Unlikeexpression arrays, the target concentrations are well above backgroundfor the majority of the fragments containing SNPs. For this assay andgenotype clustering algorithm background adjustment was not useful, andtherefore the (normalized) perfect match intensities are used withoutadjustment for background.

To account for systematic differences in relative brightness betweenfeatures, we fit the standard log-scale additive model to the probes foreach allele separately: log(Ii,j)=fi+tj+ci,j, where fi is the effect dueto feature i across experiments, tj is the effect with experiment jresponding to the genotype of the SNP and the relative quantity of thefragment on which it is located (because of cross-hybridization to theother allele it cannot be interpreted as simply the effect due to theconcentration of target for allele A), and £i,j is the multiplicativeerror for the observation. We fit this model using the standard medianpolish procedure for f and t, and for each experiment output the fittedvalue fort as the signal for that allele. For identifiability, werequire sum(f)=0. The output signal value is retransformed to lie on theoriginal linear intensity scale: signal=exp(t).

These stages constitute the normalization and allele summarizationportion of the algorithm. At the end of these steps, we have for eachSNP in each experiment two signal values: one for the “A” allele probeset, and one for the “B” allele probe set. Each SNP therefore has a 2×Nmatrix of values output—2 signals for each of N experiments. This outputmatrix is then used to evaluate each SNP for the genotype present ineach experiment.

Clustering Space Transformation—Once we have signals for the two allelesof the SNP across all experiments, we evaluate distances between aprototype (cluster center) for a given genotype (AA, AB, BB) and theactual data seen in any one experiment. However, raw “signal” value,while very useful for expression analysis, is not perfectly suited forgenotype cluster analysis (FIG. 6a ). We transform each pair of signalsfor each experiment into a space with properties more suitable forevaluating genotypes.

FIG. 13 shows Clustering Space Transformations. Here a simulated SNP(the data is artificial and used for illustration) is taken through thetransformations used in BRLMM-P. The upper left shows summarized alleleintensities, the lower left shows the log-transformed intensities, theupper right shows the transformation to contrast(sinh(K*(AB)/(A+B)/sinh(K)), and size(log(A+B)), and the lower rightshows the assignment of the number of B alleles to each data point for apotential seeding. In all cases, BB points are green, AB are red, AApoints are black with genotypes assigned by the design reference.

The desirable qualities for such a space include approximateindependence of the difference between genotypes and the magnitude ofsignal, and controlling the variation within the various clusters to becomparable. For example, the standard “MvA” or “MA” transformation usedto plot expression analysis could be applied to the two signals,resulting in M=log(SA)−log(SB) and A=(log(SA)+log(SB))/2. This isolatesmost of the difference between genotypes into the M axis, leaving amostly irrelevant “brightness” component in the A axis. The MvAtransformation is useful, but does not allow any fine tuning of clusterproperties. One hazard is that the spread of homozygous clusters (whereone allele is completely absent) can be very large if, after backgroundadjustment, the resulting signal for that allele is near zero. Signalsnear zero can be extremely variable after taking logarithms, and the MvAtransformation inherits this variability.

It is preferable to use a space in which the spread of homozygousclusters can be controlled, even when a signal estimate is near zero,and where the typical variation can be adjusted to be similar betweenheterozygous and homozygous genotype clusters. Let us define two axes:Contrast=(SA−SB)/(SA+SB) and Strength=log(SA+SB). Strength of coursemeasures the overall brightness, which is mostly independent ofgenotype, and Contrast is a quantity that will depend most strongly ongenotype ranging from −1 for the ideal BB genotype to +1 for the idealAA genotype. However, while this transformation limits the range of theresulting value, and so limits the variation, there is no guarantee thatthe result of this transformation will have similar variation betweenthe heterozygous cluster and the homozygous clusters. We furthergeneralize the Contrast axis to define a Transformed Contrast=asinh(K(SA−SB)/(SA+SB))/a sinh(K), where K is a tuning constant. FIG. 7shows the functional form of this transformation for different values ofK. The effect of varying K is to change the amount of “stretch” of thedifference between A and B signals when the difference is small (i.e.likely to be heterozygous), vs. the difference between A and B signalswhen the difference is large (i.e. likely to be homozygous), thus K canbe used to balance the variability in homozygous and heterozygousgenotypes and remove any heterozygous dropout. By experimentation acrossseveral data sets, it was ascertained that the value K=2 worked well tobalance the variation of genotype clusters (FIG. 6d ).

While many other transformations of the data could be used, this spaceworked well for clustering genotypes while avoiding heterozygousdropout. The “Contrast Center Stretch” (CCS) option was implementedwithin was the software, and cluster in this transformed signal space.The largest quantity of information about the genotype is containedwithin the contrast dimension, with minimal information about thegenotype in the “size” dimension. For BRLMM-P, we only retain thecontrast information for each SNP, and cluster in the resultingI-dimensional space.

FIG. 14: Examples of the Cluster Center Stretch (CCS) transformation.The CCS transformation is defined as asinh(K*Contrast)/asinh(K) whereContrast is defined as (SA−SB)/(SA+SB). The effect of the transformationis to stretch contrast values near zero (corresponding to heterozygousgenotypes) and to compress contrast values near −1 and +1 (correspondingto homozygous genotypes). Higher values of K apply a more extremetransformation, setting K to 1 yields effectively an identitytransformation. The value of K can thus be tuned to alter the balancebetween performance on homozygotes and heterozygotes, with higher Kvalues making heterogenous calls more likely.

Calling Genotypes—genotypes are called by a template-matching procedurecomparing the transformed allele signal values observed in an experimentto the typical values (prototype) we expect for each genotype. Thegenotype that is estimated to have the highest probability of havingproduced the data point is reported as the call. The approximateconfidence for that call is the estimated probability that the datapoint belongs to one of the other clusters. This allows the genotypeassignments to be ranked by quality, and hence make the decision not tocall in cases of ambiguity.

Every autosomal SNP is expected to have three genotypes, “AA”, “AB”, and“BB” For each genotype for a given SNP, it is expected to have aprototype (typical observed values for that genotype, or clustercenter), with some scatter of values around the prototype. In apreferred embodiment the scatter is approximated by a normaldistribution (and the careful choice of the CCS transformation ensuresthis is a good approximation). For clusters of this type, the relativeprobability of belonging to a cluster is computed as a function of thedistance from the cluster center and the variation within the cluster.The standard settings for BRLMM-P (which may be altered by advancedusers) compute a common variance for all clusters.

Within any experiment, it is preferred to derive transformed contrastvalues x for a SNP and compare to the SNP-specific prior on clustercharacteristics, the derivation of which is outlined in the followingsection. The SNP-specific prior includes three cluster centers μAA, μAB,and μBB with covariance matrices ΣAA, ΣAB and ΣBB, from which we obtainrelative probabilities p(AA), p(AB), and p(BB). Note that we retain,where possible, similar notation to that used in the description of theprior BRLMM algorithm, though for BRLMM-P the covariance matrices arejust scalar values since the clustering is performed in aone-dimensional space). We call the genotype of the SNP as the genotypewith the highest probability, X, where P(X)>P(Y)>P(Z).

The confidence we assign to this call is P(Y)+P(Z), where P(X) is theestimated probability for the called cluster. This confidence is alwaysbetween zero and 1 (in fact, it is difficult for it to be above 0.66).It is a rough measure of the quality of the call (but is not a“p-value”). We set a threshold for quality of 0.05 for a call/no-calldecision, based on the performance on several test data sets. This canbe adjusted by the user to tune the tradeoff between call rate andaccuracy—see the results section for a comparison of performance atvarious thresholds.

Estimating Cluster Centers and Variances—The above section dealt withhow to call genotypes and ascribe confidence values to those calls givenan appropriate prototype. This section deals with how to derive theseprototypes. This is achieved using a Bayesian procedure, in which wevisit every SNP and combine a prior (estimate before seeing theparticular data set, plus uncertainty in that value) for that SNP withthe data observed to obtain a posterior estimate of cluster centers andvariances. The prior may be a generic prior common to all SNPs, or aspecific prior computed for that SNP from a set of training data. Aprior has entries for the expected center of each genotype, the expectedvariance of each genotype, the uncertainty in those estimates (measuredin ‘pseudo-observations’), and covariances between those genotypecenters. The posterior estimate has the same structure (mean, variance,uncertainty, and covariances). The posterior estimate is what is thenused to call genotypes.

We do not know the actual division of the data into genotypes whencombining the prior and the data. BRLMM solved this problem by using anexternal source of seed genotypes (DM) giving reliable data for a subsetof data points. BRLMM-P solves this problem by evaluating all plausibleassignments of ‘seed’ genotypes to the full set of data points withrespect to their likelihood, and then averaging over the most likelyseeds. That is, we repeatedly make a “hard” (every data point isassigned to exactly one genotype cluster) assignment of data points to“seed” genotypes, and evaluate the likelihood of the data under aGaussian cluster model to evaluate the quality of this ‘hard’ assignmentof genotypes to data points (this is similar to a K-means procedure). Wecombine the most likely ‘hard’ assignments into a “soft” (allowing adata point to be partially assigned to more than one genotype cluster)assignment of seed genotypes, which we treat as a reliable seed. Once wehave a reliable assignment of seed genotypes, we can compute theposterior distribution of the locations of the three clusters.

We observe that plausible “hard” assignments of genotypes to data pointshave the following structure: sweeping from left to right in contrastspace, we will always see some number (possibly zero) of BB genotypes,followed by some number (possibly zero) of AB genotypes, followed bysome number (possibly zero) of AA genotypes. That is, the more copies ofthe B allele we have, the higher the relative intensity of the B probesrelative to the A probes. Given the data, plausible genotypes areassigned as though there were two dividing contrast values(corresponding to vertical lines in contrast/size space) that determinethe transitions between genotypes (BB to AB, and AB to AA).

FIG. 15 is an example division of (simulated) data. Two dividing linesdivide the data into three assigned genotypes, BB in green, AB in red,and AA in black. Within each genotype, we compute a mean and variance,and combine with the prior to obtain a posterior estimate of mean andvariance for each cluster. The log-likelihood of the data is computedgiven these distributions and the hard assignment of seed genotypes toclusters. FIG. 16 shows another example of dividing the data, with alower likelihood. While a human eye can clearly see that dense clustersin the data are split, the computer must evaluate the likelihood of thedata given this clustering to find that this is a suboptimal assignmentof genotypes to data. FIG. 17 shows a division of the data that includesno AA genotypes (black, 0 B alleles). Note that there is still acomputed mean and variance for the AA cluster, despite no seedsdesignated AA being present. This mean and variance is computed via theprior and is based on the prior center, as well as the covariancesbetween cluster centers.

This implies that for N data points, there are only (N+1)*(N+2)/2plausible ways of dividing the data into genotypes. Rather thanfollowing an iterative procedure (such as K-means or EM), we simplyenumerate them all and thereby avoid any problems of being trapped inlocal maxima of the likelihood when looking at the fit to the data. Thisis not particularly time consuming because the use of Gaussianlikelihoods allows us to evaluate each plausible assignment in O(1) timeby means of running sums. (We have employed additional computationalmethods that allow for linear scaling to large numbers of data points.)For BRLMM-P, we use the normal-inverse-gamma prior for the distributionof the mean and variance of each cluster center. (This differs from thesemi-conjugate distribution used in BRLMM) If we denote: K=covariancematrix between cluster centers (expressed in pseudo-observations)S=variance of observations u=prior means; m=observed means;N=diagonal—number of observations in each cluster. Then for theconjugate prior, the variance of observations factors out of theupdating formula for the cluster centers and results in the updateformula: (K{circumflex over ( )}−1+N){circumflex over( )}−1*(K{circumflex over ( )}−1*u+N*m). This is different from BRLMM inthat we compute the shift in means without first constructing aSNP-specific variance matrix. The conjugate prior links the mean andvariance of each cluster more tightly than the semi-conjugate prior usedin BRLMM, and therefore simplifies the computation. This lightens thecomputational load of computing posteriors and is an advantage of usingthe conjugate prior rather than the semi-conjugate prior used in BRLMM.

Interpreting the formula, the cluster centers move to the averagebetween the mean of the data, weighted by the number of observations ofeach genotype, and the prior location, weighted by the effective numberof pseudo-observations provided in the prior. The variance is thencomputed as a weighted average between the observed variation, the priorvariance, and the distance by which the centers have moved from theprior location. This Bayesian update has the sensible property that whenthere is little or no data available the estimate of cluster centerswill be driven mainly by the prior estimate, u, and when there is a lotof data available for a given genotype the estimate will be driven bythe observed means, m.

The complete computation loop looks like the following: for eachplausible assignment of seed genotypes, evaluate the likelihood of theassignment given the posterior likelihood of the clusters. Given thelikelihoods of all assignments, compute a relative probability for eachdata point to be each genotype (i.e. a “soft” assignment obtained by aweighted average over all plausible “hard” assignments). Use thisresulting “soft” assignment to seed the final computation of theposterior distribution of centers and spread for each genotype. Withthese posterior estimates of center and spread for each cluster,genotypes and confidences are then determined as outlined in theprevious section.

Special Cases—The preceding algorithm assumes that the observations foreach SNP are well described by prototypes for each genotype. However,for SNPs on the X chromosome, there are distinct clusters for eachgender due to males having one fewer copy of the X chromosome. This notonly changes the location of the cluster centers for XY individuals, butthe SNPs located on chrX may end up being called as heterozygote. Wetherefore treat the chrX SNPs differently for XX individuals than for XYindividuals. Note that the special treatment of chrX SNPs described hereis only applied to SNPs on chrX in the nono-pseudo-autosomal region, andfor the rest of this section when we talk about chrX it is to beinterpreted as chrX excluding the pseudo-autosomalregion.

We detect the difference between XY and XX individuals by thedistribution of observations in contrast space for all chrX snps. XYindividuals are estimated as those where the distribution of all chrXSNP contrast values within the sample divides into three clusters by EMwith fewer than 10% of chrX SNP values in the middle cluster. Thisdecision rule allows for some frequency of misclassification of chrXSNPs when treating them uniformly, while robustly discriminating malesfrom females in natural populations. The remaining individuals areclassified as XX. For each chrX SNP, we treat XX individuals and XYindividuals as separate data sets.

XX individuals are handled using the standard BRLMM-P methodology forall chrX SNPs, that is, three cluster centers are learned from the dataalong with within-cluster spread and used to classify observations.However, no data from XY individuals is used in this calculation. XYindividuals are handled using a modification of the BRLMM-P methodologyfor all chrX SNPs. Only two cluster centers can be learned from the data(AA and BB), and only the data for the XY individuals are used.Therefore the following modifications are performed First, we onlyevaluate assignments of AA and BB genotypes as a “seed”, and ignore thecomputed AB cluster completely for both likelihood and making genotypecalls from the posterior. Thus, for XY individuals, only “AA” and “BB”genotypes are fit, and for any observed data, “AB” will never be called.

Fitting of XX and XY individuals separately improves the genotypingperformance within each group. Modifying the prior for XY individuals toavoid heterozygous calls improves the genotyping performance for XYindividuals. This is the justification for having a special purposemodification for chrX SNPs within BRLMM-P.

Another special case is that of a SNP with unusual behavior, such as aSNP with probes for the A allele having a different sequence than probesfor the B allele (for example, the A allele probe could be from onestrand and the B allele probe from the other strand). Such a SNP mayhave a very unusual location for the cluster centers when compared tothe typical SNP on the array. This may lead to erroneous assignment ofcluster identity if, for example, the AB cluster is located where the BBcluster is on a typical SNP. With sufficient data to show examples ofall three clusters, such a mis-assignment will usually be corrected,however, for those SNPs with rare minor alleles, this may require alarge number of samples.

To handle exceptional cases such as this and to improve the performanceon more conventional SNPs, we allow for the provision of a SNP-specificprior for each SNP. This takes data (labeled and/or unlabeled) from atraining set and provides information on where the training genotypesare located. This is very similar in effect to clustering the observeddata with the training data, and requires that lab procedures besufficiently similar between training and observed data so that they maybe clustered together.

Pre-screening samples—In the typical workflow it is very useful to havea simple metric that can be computed based on a single experiment todetermine if the experiment is of high enough quality to be consideredas completed and ready for future multiple-sample analysis, or if itshould be repeated. BRLMM-P yields very high quality genotype calls butit is inherently a multiple sample method, and the exact results for anygiven sample will depend in part on the batch in which the sample isanalyzed, so it is not ideally suited for in-lab single-chip qualitydetermination.

With this application in mind, we have taken advantage of the DMgenotype calling algorithm [Xiaojun Di, al., “Dynamic model basedalgorithms for screening and genotyping over 100K SNPs onoligonucleotide microarrays”. See U.S. patent application Ser. Nos.10/657,481; 10/986,963; and Ser. No. 11/157,768. Bioinformatics 200521(9):1958-1963]. It is a single chip analysis method and call rateswith DM are very strongly correlated with call rates and concordancewhen experiments are ultimately re-called with BRLMM-P or othermultiple-sample genotype calling methods. The SNP 5.0 array has a set of3,022 SNPs tiled with both PM and MM probes so that each chip can beanalyzed with DM (at a confidence threshold of 0.33) to produce a callrate. We call this metric the QC call rate. The 3,022 SNPs tiled forcalling with DM are a subset of the 500,568 SNPs on the Mapping500Kproduct (1,511 from each of the Nsp and Sty arrays), see the netAffxAnalysis center on the Affymetrix web site for more information, butthey are not a random sample—the pool is intentionally enriched for SNPsthat were more challenging to call in the Mapping500K to yield a moresensitive metric of quality.

The recommended protocol is that experiments with a QC call rate of 86%or better should be considered as complete and are expected to result ina call rate of 97% or better when recalled with BRLMM-P. Note that thisthreshold of 86% is specific to the SNP 5.0 array and the particular setof 3,022 SNPs tiled on it. QC call rates of 65%, 70%, 75%, 80% or 85%may also be acceptable.

The ideal way to assess performance would be to evaluate the tradeoffbetween accuracy and call rate in data generated from a collection ofsamples for which the true reference genotypes are available for allSNPs on the SNP5.0 or other arrays. Fortunately something closelyapproximating this has been made possible by the International HapMapConsortium—the phase 2 release provides reference calls on a collectionof 270 samples for approximately 70% of the SNPs on array. Ifsubmissions to HapMap by Affymetrix are included this rises to 97% ofthe SNPs, however for the sake of computing concordance we try to avoidoverestimating concordance by including only the non-Affymetrixsubmissions to HapMap. This constitutes an excellent resource for theperformance evaluation; though it is important to bear in mind thecaveat that the genotype calls in HapMap themselves do have some smallbut non-zero error rate. Additionally, the HapMap samples consist ofsome trios, enabling the evaluation of Mendelian inheritance errorrates. Finally, we also look at reproducibility of genotype calls onsample replicates.

For evaluation of call rates, accuracy and Mendelian inheritance errorrate we use four datasets consisting of HapMap samples. The firstdataset consists of all 270 HapMap samples, processed jointly byAffymetrix and the Broad Institute. The remaining three sets use acollection of 44 HapMap samples comprising 30 unique DNAs (10 trios)with five of the samples run multiple times to evaluate reproducibility.

To account for the fact that one can adjust the confidence threshold totrade off between call rate and accuracy we look at performance at allpossible thresholds and plot the relationship between HapMap concordanceand no-call rate, as shown in FIG. 11. Table 1 presents performance forBRLMM-P at its default confidence threshold.

TABLE 1 Performance on HapMap dataset for BRLMM-P at various fixedthresholds. Results are based on 440,794 SNPs. Confidence Overall HomHet Overall Hom Het Method Threshold Call Rate Call Rate Call RateConcordance Concordance Concordance DM 0.26 94.16% 97.24% 86.32% 99.15%99.39% 98.38% DM 0.33 95.96% 98.24% 90.16% 98.94% 99.27% 97.93% BRLMM0.3 97.40% 97.40% 97.75% 99.40% 99.34% 99.55% BRLMM 0.4 98.27% 98.30%98.48% 99.31% 99.25% 99.47% BRLMM 0.5 98.79% 98.82% 98.93% 99.26% 99.20%99.40% BRLMM 0.6 99.15% 99.18% 99.25% 99.17% 99.11% 99.33%

One caveat about evaluating concordance with HapMap is that to someextent it provides only a lower bound estimate for accuracy, sinceHapMap itself does have a certain error rate. With this in mind, it isuseful to look at additional measures of performance. All four datasetssummarized here contain (father, mother, child) trios of samples whichcan be assessed for Mendelian consistency. The Mendelian consistency isestimated looking only at informative trios (those in which we have acall for all three samples where the parents are not both calledheterozygous), call this number T. If the number of such trios whichexhibit a Mendelian inconsistency is E then the Mendelian consistency isestimated as (T−E)/3T.

The final metric of performance we evaluate is reproducibility on samplereplicates. Arguably this metric is less useful than those above sinceit only reports on the consistency of calls made but not on whether ornot those calls are actually correct. Nevertheless, other things beingequal a reproducible method will generally be preferable to one thatisn't. The first dataset does not contain replicates, for the otherthree datasets the pairwise reproducibility of calls is on average99.9%.

FIG. 18 shows the performance of BRLMM-P on HapMap samples. Concordancewith HapMap and genotype call rate is determined at all possiblethresholds, the plots depict the tradeoff between the two performancemetrics. The default confidence threshold of 0.5 is indicated as a pointon each curve. The red curves present the tradeoff on all genotypescombined, the blue and green curves summarize performance looking only agenotypes that HapMap indicates are heterozygous and homozygousrespectively.

BRLMM-P enables accurate calling of genotypes using only PM probes. Thisallows for more SNPs on an array of a given size. The performance iscomparable to the performance of BRLMM on Mapping 500K arrays. As amultiple chip method it has some extra considerations which need to betaken into account in practice.

One matter to consider is the batch size in which to apply BRLMM-P. Moresamples will generally lead to better performance, however very highperformance could be attained with 44 or even fewer samples. Fewersamples could include 30 or fewer, or even 20 or fewer. BRLMM-P performsslightly better on SNPs for which there are observations of all threegenotypes. As a result the addition of more samples is expected to bemainly of benefit to SNPs of lower minor allele frequencies, which willbe more likely to have only one or two observed genotypes for low numberof samples.

Another observation is that reliability of calls improves as the numberof observations in the genotype cluster increases. Thus the addition ofmore samples will tend to be of most benefit to rare genotypes. Sincethe main benefit is to rarer genotypes, addition of more samples mayappear to provide marginal benefit when one focuses on overallperformance. Another important consideration is the extent to whichdatasets can be combined. More samples should improve performance,particularly for rare genotypes. However, the validity of combination ofdatasets will depend on the degree to which the combined datasets havethe same underlying probe intensity distribution and SNP clusterproperties. A good way to check the appropriateness of combiningdatasets is to inspect SNP cluster centers for each dataset separatelyand to check the degree to which the cluster centers and variances areconsistent both with each other and with the SNP specific priordistributions being supplied to BRLMM-P.

Finally, while BRLMM-P removes the reliance on MM probes, it is only oneof a variety of genotype calling algorithms that either exist already orare in development. Currently the list of alternatives includes CRLMM[Benilton Carvalho, Terence P. Speed, and Rafael A. Irizarry,“Exploration, normalization and genotype calls of high densityoligonucleotide SNP array data” (July 2006). Johns Hopkins University,Dept. of Biostatistics Working Papers. Working Paper 111, and GEL [DanL. Nicolae, Xiaolin Wu, Kazuaki Miyake and Nancy J. Cox “GEL: a novelgenotype calling algorithm using empirical likelihood” Bioinformatics22(16), 1942-1947].

Accounting for probe specific effects results in lower variance onallele signal estimates—This step is retained even in arrays employingmultiple copies of the same probe, even though the probe specificeffects are only minimally different between copies. The distribution ofsummary values across arrays is then used to evaluate the likelygenotypes.

Having described various embodiments and implementations, it should beapparent to those skilled in the relevant art that the foregoing isillustrative only and not limiting, having been presented by way ofexample only. Many other schemes for distributing functions among thevarious functional elements of the illustrated embodiment are possible.The functions of any element may be carried out in various ways inalternative embodiments.

As will be appreciated by those skilled in the relevant art, thepreceding and following descriptions of files generated by applications372 are exemplary only, and the data described, and other data, may beprocessed, combined, arranged, and/or presented in many other ways.Also, those of ordinary skill in the related art will appreciate thatone or more operations of applications 372 may be performed by softwareor firmware associated with various instruments. For example, scanner100 could include a computer that may include a firmware component thatperforms or controls one or more operations associated with scanner 100.

Also, the functions of several elements may, in alternative embodiments,be carried out by fewer, or a single, element. Similarly, in someembodiments, any functional element may perform fewer, or different,operations than those described with respect to the illustratedembodiment. Also, functional elements shown as distinct for purposes ofillustration may be incorporated within other functional elements in aparticular implementation. Also, the sequencing of functions or portionsof functions generally may be altered. Certain functional elements,files, data structures, and so on may be described in the illustratedembodiments as located in system memory of a particular computer. Inother embodiments, however, they may be located on, or distributedacross, computer systems or other platforms that are co-located and/orremote from each other. For example, any one or more of data files ordata structures described as co-located on and “local” to a server orother computer may be located in a computer system or systems remotefrom the server. In addition, it will be understood by those skilled inthe relevant art that control and data flows between and amongfunctional elements and various data structures may vary in many waysfrom the control and data flows described above or in documentsincorporated by reference herein. More particularly, intermediaryfunctional elements may direct control or data flows, and the functionsof various elements may be combined, divided, or otherwise rearranged toallow parallel processing or for other reasons. Also, intermediate datastructures or files may be used and various described data structures orfiles may be combined or otherwise arranged. Numerous other embodiments,and modifications thereof, are contemplated as falling within the scopeof the present invention as defined by appended claims and equivalentsthereto.

What is claimed is:
 1. A method for calling genotypes from a probe array experiment, comprising: contacting a genotyping probe array with a DNA sample to obtain a hybridized sample; obtaining hybridization intensity information from the hybridized sample; normalizing probe intensity data; transforming the data by clustering the data for each genetic loci of interest to obtain transformed data; dividing the transformed data into trial genotypes by generating genotyping calls; calculating the highest log-likelihood of data; and assigning a genotype to the data.
 2. The method of claim 1 wherein the array comprises more than 1 million different sequence probes and wherein each probe is perfectly complementary to an allele of a SNP that has a minor allele frequency of at least 2% in a population.
 3. A method for determining the genotype of a nucleic acid sample at a plurality of SNPs, comprising: (a) hybridizing said nucleic acid sample to an array of allele specific probes to obtain raw probe intensity measurements; (b) normalizing said raw probe intensity measurements to obtain normalized probe intensities; performing an allele summarization to obtain allele signal estimates; (c) transforming said allele signal estimates in clustering space to obtain transformed allele signal estimates; (d) generating seed genotypes from said raw probe intensity measurements to obtain initial genotype estimates; (e) fit prior across multiple SNPs using initial genotype estimates and transformed allele signal estimates to obtain prior for cluster characteristics; and (f) update the cluster characteristics using a Bayesian update to obtain posterior for cluster characteristics; and make genotype calls to obtain the genotype of the nucleic acid sample at a plurality of SNPs.
 4. The method of claim 3 further comprising obtaining confidence values for each of the genotype calls. 